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Abstract — Query  privacy  in  secure  DBMS  is  an  important 
feature,  although  rarely  formally  considered  outside  the  theoret¬ 
ical  community.  Because  of  the  high  overheads  of  guaranteeing 
privacy  in  complex  queries,  almost  all  previous  works  addressing 
practical  applications  consider  limited  queries  (e.g.,  just  keyword 
search),  or  provide  a  weak  guarantee  of  privacy. 

In  this  work,  we  address  a  major  open  problem  in  private 
DB:  efficient  suhlinear  search  for  arbitrary  Boolean  queries.  We 
consider  scalable  DBMS  with  provable  security  for  all  parties, 
including  protection  of  the  data  from  both  server  (who  stores 
encrypted  data)  and  client  (who  searches  it),  as  well  as  protection 
of  the  query,  and  access  control  for  the  query. 

We  design,  build,  and  evaluate  the  performance  of  a  rich 
DBMS  system,  suitable  for  real-world  deployment  on  today 
medium-  to  large-scale  DBs.  On  a  modern  server,  we  are  able  to 
query  a  formula  over  10TB,  lOOM-record  DB,  with  70  searchable 
index  terms  per  DB  row,  in  time  comparable  to  (Insecure) 
MySQL  (many  practical  queries  can  be  privately  executed  with 
work  1.2-3  times  slower  than  MySQL,  although  some  queries  are 
costlier). 

We  support  a  rich  query  set,  including  searching  on  arbitrary 
boolean  formulas  on  keywords  and  ranges,  support  for  stemming, 
and  free  keyword  searches  over  text  fields. 

We  identify  and  permit  a  reasonable  and  controlled  amount  of 
leakage,  proving  that  no  further  leakage  is  possible.  In  particular, 
we  allow  leakage  of  some  search  pattern  information,  but  protect 
the  query  and  data,  provide  a  high  level  of  privacy  for  individual 
terms  in  the  executed  search  formula,  and  hide  the  difference 
between  a  query  that  returned  no  results  and  a  query  that 
returned  a  very  small  result  set.  We  also  support  private  and 
complex  access  policies.  Integrated  in  the  search  process  so  that 
a  query  with  empty  result  set  and  a  query  that  falls  the  policy 
are  hard  to  tell  apart. 


1.  Introduction 

Motivation.  Over  the  last  two  decades,  the  amount  of  data 
generated,  collected,  and  stored  has  been  steadily  increas¬ 
ing.  This  growth  is  now  reaching  dramatic  proportions  and 
touching  every  aspect  of  our  life,  including  social,  political, 
commercial,  scientific,  medical,  and  legal  contexts.  With  the 
rise  in  size,  potential  applications  and  utility  of  these  data, 
privacy  concerns  become  more  acute.  For  example,  the  recent 
revelation  of  the  U.S.  Government’s  data  collection  programs 
reignited  the  privacy  debate. 

We  address  the  issue  of  privacy  for  database  management 
systems  (DBMS),  where  the  privacy  of  both  the  data  and 
the  query  must  be  protected.  As  an  example,  consider  the 
scenario  where  a  law  enforcement  agency  needs  to  search 


airline  manifests  for  specific  persons  or  patterns.  Because  of 
the  classified  nature  of  the  query  (and  even  of  the  existence  of 
a  matching  record),  the  query  cannot  be  revealed  to  the  DB. 
With  the  absence  of  truly  reliable  and  trusted  third  parties, 
today’s  solution,  supported  by  legislation,  is  to  simply  require 
the  manifests  and  any  other  permitted  data  to  be  furnished 
to  the  agency.  However,  a  solution  that  allows  the  agency  to 
ask  for  and  receive  only  the  data  it  is  interested  in  (without 
revealing  its  interest),  would  serve  two  important  goals: 

•  allay  the  negative  popular  sentiment  associated  with  large 
personal  data  collection  and  management  which  is  not 
publicly  accounted. 

•  enhance  agencies’  ability  to  mine  data,  by  obtaining 
permission  to  query  a  richer  data  set  that  could  not  be 
legally  obtained  in  its  entirety. 

In  particular,  we  implement  external  policy  enforcement  on 
queries,  thus  preventing  many  forms  of  abuse.  Our  system 
allows  an  independent  oblivious  controller  to  enforce  that 
metadata  queries  satisfy  the  specificity  requirement. 

Other  motivating  scenarios  are  abundant,  including  private 
queries  over  census  data,  information  sharing  between  law 
enforcement  agencies  (especially  across  jurisdictional  and  na¬ 
tional  boundaries)  and  electronic  discovery  in  lawsuits,  where 
parties  have  to  turn  over  relevant  documents,  but  don’t  want  to 
share  their  entire  corpus  [33],  [43].  Often  in  these  scenarios 
the  (private)  query  should  be  answered  only  if  it  satisfies  a 
certain  (secret)  policy.  A  very  recent  motivating  example  [3] 
involves  the  intended  use  of  data  from  automated  license  plate 
readers  in  order  to  solve  crimes,  and  the  concerns  over  its  use 
for  compromising  privacy  for  the  innocent. 

While  achieving  full  privacy  for  these  scenarios  is  possible 
building  on  cryptographic  tools  such  as  SPIR  [24],  FHE  [21], 
ORAM  [27]  or  multiparty  computation  (MPC),  those  solutions 
either  run  in  polynomial  time,  or  have  very  expensive  basic 
steps  in  the  sublinear  algorithms.  For  example,  when  ORAM 
is  used  to  achieve  sublinear  secure  computation  between  two 
parties  [29],  its  basic  step  involves  oblivious  PRF  evaluation. 
[29]  reports  that  it  takes  about  1000  seconds  to  run  a  binary 
search  on  entries;  subsequent  works  [22],  [39]  remain  too 
expensive  for  our  setting.  On  the  other  hand,  for  data  sets  of 
moderate  or  large  sizes,  even  linear  computation  is  prohibitive. 
This  motivates  the  following. 

Design  goals.  Build  a  secure  and  usable  DBMS  system. 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 
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with  rich  functionality,  and  performance  very  close  to  existing 
insecure  implementations,  so  as  to  maintain  the  current  modus 
operandi  of  potential  users  such  as  government  agencies  and 
commercial  organizations.  At  the  same  time,  we  must  provide 
reasonable  and  provable  privacy  guarantees  for  the  system. 

These  are  the  hard  design  requirements  which  we  achieve 
with  Blind  Seer  (BLoom  filter  INDex  SEarch  of  Encrypted 
Results).  Our  work  can  be  seen  as  an  example  of  apply¬ 
ing  cryptographic  rigor  to  design  and  analysis  of  a  large 
system.  Privacy/efficiency  trade-offs  are  inherent  in  many 
large  systems.  We  believe  that  the  analysis  approach  we  take 
(identifying  and  permitting  a  controlled  amount  of  leakage, 
and  proving  that  there  is  no  additional  leakage)  will  be  useful 
in  future  secure  systems. 

Significance.  We  solve  a  significant  open  problem  in  private 
DB:  efficient  sublinear  search  for  arbitrary  Boolean  queries. 
While  private  keyword-search  was  achieved  in  some  models, 
this  did  not  extend  to  general  Boolean  formulas.  Natural  break¬ 
ing  of  a  formula  to  terms  and  individual  keyword-searching 
of  each  leaks  formula  structure  and  encrypted  results  for  each 
keyword,  significantly  compromising  privacy  of  both  query 
and  data.  Until  our  work,  and  the  (very  different)  independent 
and  concurrent  works  [11],  [31],  it  was  not  known  how  to 
efficiently  avoid  this  leakage.  (See  Section  IX  for  extended 
discussion  on  related  work.) 

A.  Our  Setting 

Traditionally,  DB  querying  is  seen  as  a  two-player  engage¬ 
ment:  the  client  queries  the  server  operated  by  the  data  owner, 
although  delegation  of  the  server  operation  to  a  third  player  is 
increasingly  common. 

Players.  In  our  system,  there  are  three  main  players:  client  C, 
server  S,  and  index  server  IS  (there  is  another  logical  entity, 
query  checker  QC,  whose  task  of  private  query  compliance 
checking  is  technically  secondary,  albeit  practically  important. 
Eor  generality,  we  consider  QC  as  a  separate  player,  although 
its  role  is  normally  played  by  either  S  or  IS).  We  split  off 
IS  from  S  mainly  for  performance  reasons,  as  two-player 
private  DBMS  querying  has  trivial  linear  in  DB  size  lower 
bounds ' ,  while  three  non-colluding  players  allow  for  far  better 
privacy-performance  trade-offs.  We  note  also  that  our  system 
can  be  generalized  to  handle  multiple  clients  in  several  ways 
(presenting  different  trade-offs),  but  we  focus  our  presentation 
on  the  single  client  setting. 

Allowed  leakage.  The  best  possible  privacy  for  us  would 
guarantee  that  C  learns  only  the  result  set,  and  IS  and  S 
learn  nothing  at  all.  However,  achieving  this  would  be  quite 
costly,  and  almost  certainly  far  too  expensive  as  a  replacement 
for  any  existing  DBMS.  Indeed,  practically  efficient  equality 
checking  of  encrypted  data  would  likely  require  the  use  deter¬ 
ministic  encryption,  which  allows  to  identify  and  accumulate 
access  patterns.  Additionally,  for  certain  conjunctive  queries, 

’This  lower  bound  can  be  circumvented  if  we  allow  precomputation,  as 
done  for  example  in  the  ORAM  based  schemes  mentioned  above.  However, 
the  resulting  solution  is  far  too  inefficient  for  practice,  as  even  its  online  phase 
is  several  orders  of  magnitude  slower  than  our  solution. 


sublinear  search  algorithms  are  currently  unknown,  even  for 
insecure  DBMS.  Thus,  unless  we  opt  for  a  linear  time  for  all 
conjunctive  queries,  the  running  time  already  inevitably  reveals 
some  information  (see  Section  VI-B  for  more  discussion). 

As  a  result,  we  accept  that  certain  minimal  amount  of 
leakage  is  unavoidable.  In  particular,  we  allow  players  C  and  IS 
to  learn  certain  search  pattern  information,  such  as  the  pattern 
of  returned  results,  and  the  traversal  pattern  of  the  encrypted 
search  tree.  We  stress  that  we  still  formally  prove  security  of 
the  resulting  system  -  our  simulators  of  players’  views  are 
given  the  advice  corresponding  to  the  allowed  leakage.  We 
specify  the  allowed  leakage  in  more  detail  in  Section  VI. 

We  note  that  this  work  was  performed  under  the  lARPA 
SPAR  program  [1].  Many  of  the  privacy  and  functionality  re¬ 
quirements  we  address  are  suggested  by  lARPA.  In  Section  X 
we  provide  further  motivation,  examples  and  discussion  of  our 
setting  and  choices. 

B.  Our  Contributions 

We  design,  prove  secure,  implement  and  evaluate  the  first  scal¬ 
able  privacy-preserving  DBMS  which  simultaneously  satisfies 
all  the  following  features  (see  the  following  sections  for  a  more 
complete  description  and  comparison  to  previous  works): 

•  Rich  functionality:  we  support  a  rich  set  of  queries 
including  arbitrary  Boolean  formulas,  ranges,  stemming, 
and  negations,  while  hiding  search  column  names  and 
including  free  keyword  searches  over  text  fields  in  the 
database.  We  note  that  there  is  no  standard  way  in 
MySQL  to  obtain  the  latter. 

•  Practical  scalability.  Our  performance  (similarly  to 
MySQL)  is  proportional  to  the  number  of  terms  in  the 
query  and  to  the  result  set  size  for  the  CNE  term  with 
the  smallest  number  of  results. 

Eor  a  DB  of  size  10TB  containing  lOOM  records  with  70 
searchable  index  terms  per  DB  row,  our  system  executes 
many  types  of  queries  that  return  few  results  in  well  under 
a  second,  which  is  comparable  to  MySQL. 

•  Provable  security.  We  guarantee  the  privacy  of  the  data 
from  both  IS  and  C,  as  well  as  the  privacy  of  C’s  query 
from  S  and  IS.  We  prove  security  with  respect  to  well 
defined,  reasonable,  and  controlled  leakage.  In  particular, 
while  certain  information  about  search  patterns  and  the 
size  of  the  result  set  is  leaked,  we  do  provide  some 
privacy  of  the  result  set  size,  suited  for  the  case  when 
identifying  that  there  is  one  result  as  opposed  to  zero 
results  is  undesirable  (Section  V-B). 

•  Natural  integration  of  private  policy  enforcement.  We 
represent  policies  as  Boolean  circuits  over  the  query,  and 
can  support  any  policy  that  depends  only  on  the  query, 
with  performance  that  depends  on  the  policy  circuit  size. 

•  Support  for  DB  updates,  deletions  and  insertions. 

To  our  knowledge  the  combination  of  performance,  features 
and  provable  security  of  our  system  has  never  been  achieved, 
even  without  implementation,  and  represents  a  breakthrough 
in  private  data  management.  Indeed,  previous  solutions  either 
require  at  least  linear  work,  address  a  more  limited  type  of 


Figure  1.  High-level  overview  of  Blind  Seer.  There  are  three  different 
operations  depicted:  preprocessing  (step  0),  database  searching  (step  1-4)  and 
data  modifications  (step  5). 

queries  (e.g.,  just  keyword  search),  or  provide  weaker  privacy 
guarantees.  The  independent  and  concurrent  work  of  [1 1],  [31] 
(also  performed  under  lARPA  SPAR  program)  is  the  only 
system  comparable  to  ours,  in  the  sense  that  it  too  features  a 
similar  combination  of  rich  functionality,  practical  scalability, 
provable  security,  and  policy  enforcement.  However,  the  trade 
offs  that  they  achieve  among  these  requirements  and  their 
technical  approach  are  quite  different  than  ours. 

Our  scale  captures  moderate-to-large  data,  which  encom¬ 
passes  datasets  in  the  motivating  scenarios  above  (such  as  the 
census  data,  on  which  we  ran  our  evaluation),  and  represents 
a  major  step  towards  privacy  for  truly  “big  data”.  Our  work 
achieves  several  orders  of  magnitude  performance  improve¬ 
ment  as  compared  to  the  fully  secure  cryptographic  solution, 
and  much  greater  functionality  and  privacy  as  compared  to 
practical  single  keyword  search  and  heuristic  solutions. 

II.  System  Design  Overview 

Participants.  Recall,  our  system  consists  of  four  participants: 
server  S,  client  C,  index  server  IS,  and  query  checker  QC. 
The  server  owns  a  database  DB,  and  provides  its  encrypted 
searchable  copy  to  IS,  who  obliviously  services  C’s  queries. 
QC,  a  logical  player  who  can  be  co-located  with  and  may 
often  be  an  agent  of  S,  privately  enforces  a  policy  over  the 
query.  This  is  needed  to  ensure  control  over  hidden  queries 
from  C.  Player  interaction  is  depicted  in  Figure  1. 

Our  approach.  We  present  a  high-level  overview  of  our 
approach  and  refer  the  reader  to  Section  IV  for  technical  de¬ 
tails.  We  adhere  to  the  following  general  approach  of  building 
large  secure  systems,  in  which  full  security  is  prohibitively 
costly:  in  a  large  problem,  we  identify  small  privacy-critical 
subproblems,  and  solve  those  securely  (their  outputs  must  be 
of  low  privacy  consequence,  and  are  handled  in  plaintext). 
Then  we  use  the  outputs  of  the  subtasks  (often  only  a  small 


portion  of  them  will  need  to  be  evaluated)  to  complete  the 
overall  task  efficiently. 

We  solve  the  large  problem  (encrypted  search  on  large 
DB)  by  traversing  an  encrypted  search  tree.  This  allows  the 
subtasks  of  privately  computing  whether  a  tree  node  has  a  child 
matching  the  (arbitrarily  complex)  query  to  be  designated  as 
security-critical.  Further,  unlike  the  protected  input  and  the 
internals  of  this  subtask,  its  output,  obtained  in  plaintext  by 
IS,  reveals  little  private  information,  but  is  critical  in  pruning 
the  search  tree  and  achieving  efficient  sublinear  (logarithmic 
for  some  queries)  search  complexity.  Putting  it  together,  our 
search  is  performed  by  traversing  the  search  tree,  where 
each  node  decision  is  made  via  very  efficient  secure  function 
evaluation  (SFE). 

We  use  Bloom  filters  (BF)  to  store  collections  of  keywords 
in  each  tree  node.  Bloom  filters  serve  this  role  well  because 
they  support  small  storage,  constant  time  access,  and  invari¬ 
ance  of  access  patterns  with  respect  to  different  queries  and 
match  outputs.  For  SFE,  we  use  state-of-the-art  Yao’s  garbled 
circuits. 

Because  of  SEE’s  privacy  guarantee  in  each  tree  node,  the 
overall  leakage  (i.e.  additional  information  learned  by  the 
players)  essentially  amounts  to  the  traversal  pattern  in  the 
encrypted  search  tree. 

We  discuss  technical  details  of  these  and  other  aspects  of 
the  system,  such  as  encrypted  search  tree  construction,  data 
representation,  policy  checking,  etc.,  in  Section  IV.  We  stress 
that  many  of  these  details  are  technically  involved. 

III.  Preliminaries 

We  assume  that  readers  are  familiar  with  pseudorandom 
generators  (PRG),  pseudorandom  functions  (PRE),  and  semi- 
homomorphic  encryption  schemes  with  semantic  security  [28], 
e.g.,  ElGamal  encryption  [19]. 

Notations.  Let  [n]  =  n}.  Eor  f-bit  strings  a  and 

b,  let  a  V  b  (resp.,  a  A  b  and  a06)  denote  the  bitwise- 
OR  (resp.  bitwise-AND  and  bitwise-XOR)  of  a  and  b.  Let 
S  =  (ii,  *2)  •  ■  ■ )  *?))  be  a  sequence  of  integers.  We  define  a 
projection  of  a  €  {0,1}^  on  S  as  aj,s=  for 

example,  with  S  =  (2,4),  we  have  0101  j,s=  11.  We  also 
define  a  filtering  of  a  =  0102  ...  by  5  as  a|s  =  6162  ■  ■  -  bi 
where  bj  =  Oj  if  j  G  S,  or  bj  =  0  otherwise;  for  example,  with 
S  =  (2,4),  we  have  1110|s  =  0100.  We  define  a  shrinking 
function  Cm  :  as  Cm(*i,*2,  ...,*,,)  =  (ji, 42,  ■  •  • 

where  jk  =  {ik  —  1)  mod  (m  +  1);  for  example,  we  have 
C3(1,3,4)  =  (1,3,1). 

Bloom  filter  (BF).  A  Bloom  filter  [8]  is  a  well-known  data 
structure  that  facilitates  efficient  search.  The  filter  B  is  a  string 
initialized  with  0^  and  associated  with  a  set  of  p  different 
hash  functions  H  =  {hi  :  {0, 1  }*—>■[£] Eor  a  keyword 
a  G  {0, 1}*,  let  'H(a)  the  sequence  of  the  hash  results  of  a, 
i.e., 

'H(a)  =  (hi(a),  (12(0;), . . . ,  hn{a)). 

To  add  a  keyword  a  to  the  filter,  the  hash  result  'H(a)  is  added 
to  it,  that  is,  B  :=  B  \/  (1^|^(q,)).  To  see  if  a  keyword  (3  is 


in  the  filter,  one  needs  to  check  if  B  contains  that  is, 

? 

^i«(/S)=  I**-  Bloom  filters  guarantee  no  false  negatives,  and 
allow  the  false  positive  rate  to  be  tuned: 

where  t  is  the  number  of  keywords  in  the  Bloom  filter.  In 
our  system,  we  choose  ry  =  20  and  £  =  28.86f  to  achieve 
FPbf  «  10-6. 

A.  Secure  Computation  Based  on  Yao’s  GC 

Yao’s  garbled  circuit  (GC).  Yao’s  garbled  circuits  allow 
circuits  to  be  evaluated  obliviously  by  one  party  on  hidden 
inputs  provided  by  another  party.  Let  C  be  a  Boolean  circuit 
with  n  input  wires,  m  gates,  and  one  output  wire;  let  (1, . . . ,  n) 
be  the  indices  to  the  input  wires  and  q  =  n  +  m  +  lbe 
the  index  to  the  output  wire.  To  generate  a  garbled  circuit 
C,  a  pair  of  random  keys  w^,wl  are  associated  with  each 
wire  i  in  the  circuit;  key  corresponds  to  the  value  ‘0’  on 
wire  i,  while  wj  corresponds  to  the  value  ‘1’.  Then,  for  each 
gate  g  in  the  circuit,  with  its  input  wires  i,j  and  its  output 
wire  k,  a  garbled  gate  g  (consisting  of  four  ciphertexts)  is 
constructed  so  that  it  will  enable  one  to  recover  fj-om 

and  lu'/  (refer  to  [14],  [36],  [40],  [48]  for  more  detail.) 
The  garbled  circuit  C  is  simply  the  collection  of  all  the  garbled 
gates.  By  recursively  evaluating  the  garbled  gates,  one  can 
compute  the  garbled  key  given  the  keys 
where  b  =  C{ai, . . . ,  a„).  We  will  sometimes  call  wire  keys 
corresponding  to  input/output  garbled  input/output,  and  denote 
them  by  d  and  b,  i.e.,  d  =  , . . . ,  tc“"),  b  =  w^.  We  will 

also  use  the  notation  of  garbled  evaluation  b  =  C(d). 

Oblivious  transfer.  An  oblivious  transfer  (OT)  [20],  [46] 
is  a  two-party  protocol  supporting  a  sender  that  holds  values 
(a:o,a;i)  and  a  receiver  that  holds  an  index  r  €  {0, 1}.  The 
receiver  learns  Xr,  but  neither  the  sender  nor  the  receiver  learns 
anything  else,  i.e.,  the  receiver  learns  nothing  about  any  other 
values  held  by  the  sender,  and  the  sender  learns  nothing  about 
the  receiver’s  index.  We  use  the  Naor-Pinkas  protocol  [42]  as 
a  basis  and  optimize  the  performance  using  OT  extension  [30] 
and  OT  preprocessing  [5]. 

Secure  computation.  It  is  known  that  Yao’s  garbled  circuit, 
in  combination  with  any  oblivious-transfer  protocol  yields  a 
constant-round  protocol  for  secure  two-party  computation  with 
semi-honest  security  [38],  [52],  [53].  In  fact,  due  to  the  privacy 
guaranteed  by  Yao’s  GC  [7],  even  if  the  circuit  67  is  a  private 
input  from  Alice  along  with  xa,  Yao’s  GC  can  also  hide 
the  circuit  67  from  Bob,  revealing  only  the  topology  of  67. 
We  use  GCs  not  only  for  search  tree  traversal  but  also  for 
policy  enforcement.  Yao’s  GC  is  one  of  the  most  efficient 
algorithms  known  for  secure  computation  of  functions.  For 
example,  a  recent  work  [51]  demonstrated  secure  evaluation 
of  AES  (a  circuit  with  33880  gates)  in  0.2  seconds.  We  use 
the  standard  techniques  of  Free-XOR  [14],  [36]  and  point-and- 
permute  [40],  [48]  in  constructing  garbled  circuits. 


BFa,b  contains  all  the  keywords  of  records  Ra,  Ra+i, .  • . ,  Rt- 
Figure  2.  Index  structure:  Bloom-filter-based  search  tree. 

IV.  Basic  System  Design 

In  this  section,  we  will  begin  by  describing  the  basic  system 
design  supporting  only  simple  private  query.  In  the  next 
section,  we  will  augment  this  basic  design  with  more  features. 

A.  BF  Search  Tree 

Our  key  data  structure  enabling  sublinear  search  is  a  BF  search 
tree  for  the  database  records.  We  stress  that  there  is  only  one 
global  search  tree  for  the  entire  database.  Let  n  be  the  number 
of  database  records  and  T  be  a  balanced  6-ary  tree  of  height 
logj  n  (we  assume  n  =  b^  from  some  positive  integer  2  for 
simplicity).  In  our  system,  6  is  set  to  10.  In  the  search  tree, 
each  leaf  is  associated  with  each  database  record,  and  each 
node  V  is  associated  with  a  Bloom  filter  By.  The  filter  By 
contains  all  the  keywords  from  the  (leaf)  records  that  the  node 
V  have  (as  itself  or  as  its  descendants).  For  example,  if  a  node 
contains  a  record  that  has  Jeff  in  the  fname  field,  a  keyword 
a  =  'fname :  Jeff '  is  inserted  to  By.  The  length  £y  of  By 
is  determined  by  the  upper  bound  of  the  number  of  possible 
keywords,  derived  from  DB  schema,  so  that  two  nodes  of  the 
same  level  in  the  search  tree  have  equal-length  Bloom  filters. 
The  insertion  of  keywords  is  performed  by  shrinking  the  output 
of  the  hash  functions  (76 (a))  to  fit  in  the  corresponding  BF 
length  £y.  Here,  H  is  the  set  of  hash  functions  associated  with 
the  root  node  BF.  See  Figure  2. 

Search  using  a  BF  search  tree.  Consider  a  simple  recursive 
algorithm  Search  below.  Let  a  and  /?  be  keywords  and  r  the 
root  of  the  search  tree.  Note  that  Search (aA/?,  r)  will  output 
all  the  leaves  (i.e.,  record  locations)  containing  both  keywords 
a  and  /?;  any  ancestor  of  a  leaf  has  all  the  keywords  that  the 
leaf  has,  and  therefore  there  should  be  a  search  path  from 
the  root  to  each  leaf  containing  a  and  /3.  This  algorithm  can 
be  easily  extended  to  searching  for  any  monotone  Boolean 
formula  of  keywords. 

Search(aA/3,  v): 

If  the  BF  By  contains  a  and  /?,  then 
If  u  is  a  leaf,  then  output  {u}. 

Otherwise,  output  [J^,  children  of  v  Search  (a A/3,  c). 

Otherwise,  output  0. 


B.  Preprocessing 

Roughly  speaking,  in  this  phase,  S  gives  an  encrypted  DB  to 
IS.  To  be  more  specific,  by  executing  the  following  protocols, 
the  two  parties  encrypt  and  permute  the  records,  create  a  search 
tree  for  the  permuted  records,  and  prepare  record  decryption 
keys. 

Encrypting  database  index/records.  In  this  step,  the  server 
first  permutes  its  DB  to  hide  information  of  the  order  of 
records  in  the  DB  and  then  creates  BF-search  tree  on  this 
permuted  DB;  these  DB  and  search  tree  are  encrypted  and 
sent  to  the  index  server. 

1)  (Shuffle  and  encrypt  the  records.)  The  server  gen¬ 
erates  a  key  pair  {pk,  sk)  for  a  public-key  semi- 
homomorphic  (e.g.,  additively  homomorphic)  encryp¬ 
tion  scheme  (Gen,  Enc,  Dec).  Given  a  database  of  n 
records,  the  server  S  randomly  shuffles  the  records.  Let 
{Ri,...,Rn)  be  the  shuffled  records.  S  then  chooses 
a  random  string  Si,  and  computes  Si^EnCpfc(si)  and 
Ri  =  G(si)0i?i,  where  G  is  a  PRG. 

2)  (Encrypt  the  BE  search  tree.)  S  constructs  a  BE  search 

tree  T  for  the  permuted  records  (i?i, . . .  It  then 

chooses  a  key  k  at  random  for  a  PRE  F.  The  Bloom 
filter  in  each  node  v  is  encrypted  as  follows: 

By  =  By  (B  Fk{v).  (This  encryption  can  be  efficiently 
decrypted  inside  SEE  evaluation  by  GC.) 

3)  (Share)  Einally,  the  S  sends  the  (permuted)  encrypted 
records  (pA:,  (si,  i?i), . . . ,  (s„, /?„))  and  the  encrypted 
search  tree  {By  :  v  G  T}  to  the  index  server.  The  client 
will  receive  the  PRE  key  k,  and  the  hash  functions  H  = 

used  in  the  Bloom  filter  generation. 

Preparing  record  decryption  keys.  To  save  the  decryption 
time  in  the  on-line  phase,  the  index  server  and  the  server 
precompute  record  decryption  keys  as  follows: 

(Blind  the  decryption  keys)  The  index  server  IS  chooses 
a  random  permutation  ip  :  [n]— ^[n].  Eor  each  i  G  [n],  it 
chooses  Ti  randomly  and  computes  Si-EnCpk{ri). 

Then,  it  sends  (s)^, . . . ,  to  S.  Then,  the  server  de¬ 
crypts  each  s'  to  obtain  the  blinded  key  s'.  Note  that  it 
holds  =  str^. 

C.  Search 

Our  system  supports  any  SQL  query  that  can  be  represented  as 
a  monotone  Boolean  formula  where  each  variable  corresponds 
to  one  of  the  following  search  conditions:  keyword  match, 
range,  and  negation.  So,  without  loss  of  generality,  we  support 
non-monotone  formulas  as  well,  modulo  possible  performance 
overhead  (see  how  we  support  negations  below).  See  Eigure  3 
as  an  example. 

Traversing  the  search  tree  privately.  The  search  procedure 
starts  with  the  client  transforming  the  query  into  the  corre¬ 
sponding  Boolean  circuit.  Then,  starting  from  the  root  of  the 
search  tree,  the  client  and  the  index  server  will  compute  this 
circuit  Q  via  secure  computation.  If  the  circuit  Q  outputs  true, 
the  parties  visit  all  the  children  of  the  node,  and  again  evaluate 


Query:  SELECT  *  FROM  main  WHERE 

(fname  =  JEFF  OR  fname  =  JOHN)  AND  zip  =  34301  AND  income  <  200 


Logic  Circuit: 

Circuit: 

=> 

jsT 

^>1 

Ti  \ 

Ts 

Ti 

/t\  a  L 

ri:fname  =  JEFF 
T2:fname  =  JOHN 

r3:zip  =  34301 
T4:income<200 

Figure  3.  High  level  circuit  representation  of  a  query. 


this  circuit  Q  on  those  nodes  recursively,  until  they  reach  leaf 
nodes;  otherwise,  the  traversal  at  the  node  terminates.  Note 
that  evaluation  of  Q  outputs  a  single  bit  denoting  the  search 
result  at  that  node.  It  is  fully  secure,  and  reveals  no  information 
about  individual  keywords. 

In  order  to  use  secure  computation,  we  need  to  specify  the 
query  circuit  and  the  inputs  of  the  two  parties  to  it.  However, 
since  the  main  technicalities  lie  in  constructing  circuits  for  the 
variables  corresponding  to  search  conditions,  we  will  describe 
how  to  construct  those  sub-circuits  only;  the  circuit  for  the 
Boolean  formula  on  top  of  the  variables  is  constructed  in  a 
standard  manner. 

Keyword  match  condition.  We  first  consider  a  case  where 
a  variable  corresponds  to  a  keyword  match  condition.  For 
example,  in  Figure  3  the  variable  Ti  indicates  whether  the 
Bloom  filter  By  in  a  given  node  v  contains  the  keyword  a  = 
'fname  :  JEFF' .  Consider  the  Bloom  filter  hash  values  for  the 
keyword  a,  and  let  Z  denote  the  positions  to  be  checked,  i.e., 
Z  :=  Q^(L((a)).  If  the  Bloom  filter  By  contains  the  keyword 
a,  the  projected  bits  w.r.t.  Z  should  be  all  set,  that  is,  we  need 
to  check 

Byiz  =  1’'.  (1) 

Recall  that  the  index  server  has  an  encrypted  Bloom  filter 
By  =  By  (B  Fk{v),  and  the  client  the  PRE  key  k  and  the  hash 
functions  H  =  Therefore,  the  circuit  to  be  computed 

should  first  decrypt  and  then  check  the  equation  (1).  That  is, 
the  keyword  match  circuit  looks  as  follows: 

V 

KM((6i,...6^),(ri,...,r^))  =  /\(6i0ri). 

Here,  (6i, . . . ,  is  from  the  encrypted  BE  and  (ri, . . . ,  r^) 
from  the  pseudorandom  mask.  That  is,  to  this  circuit  KM,  the 
index  server  will  feed  By^z  as  the  first  part  (&i, . . .  ,bjj)  of 
the  input,  and  the  client  will  feed  Fk{v)  Pz  as  the  second 
(ri, . . . ,  r^).  In  order  that  the  two  parties  may  execute  secure 
computation,  it  is  necessary  that  the  client  compute  Z  and 
send  it  (in  plaintext)  to  the  index  server. 

Range/negation  condition.  Consider  the  variable  T4  in  Fig¬ 
ure  3  for  example.  Using  the  technique  from  [47],  we  augment 
the  BE  to  support  inserting  a  number  x  G  say  with 

n  =  32,  and  checking  if  the  BE  contains  a  number  in  a  given 
range. 


To  insert  an  integer  a  in  a  BF,  all  the  canonical  ranges 
containing  a  are  added  in  the  filter.  A  canonical  range  with 
level  i  is  [a;2*,  (x  +  l)2*)  for  some  integer  x,  so  for  each  level, 
there  is  only  one  canonical  range  containing  the  number  a. 
In  particular,  for  each  i  G  Zn,  compute  Xi  such  that  a  G 
[xi2^,  (xi  +  1)2*)  and  insert  'r  :  income  :  i  :a;i'  to  the  Bloom 
filter. 

Given  a  range  query  [a,  5),  we  check  whether  a  canonical 
range  inside  the  given  query  belongs  to  the  BF.  In  particular, 
for  each  i  G  Z„,  find,  if  any,  the  minimum  yi  such  that 
[yi2*,(yi  +  1)2*)  G  [a,b)  and  the  maximum  Zi  such  that 
[zi2*,(zi  +  1)2*)  e  [a,  6);  then  check  if  the  BF  contains  a 
keyword  'r  :  income  or  'r  :  income  :i :  .  If  any  of 

the  checks  succeeds  for  some  i,  then  output  yes;  otherwise 
output  no.  Therefore,  a  circuit  for  a  range  query  is  essentially 
ORs  of  keyword  match  circuits. 

For  example,  consider  a  range  query  with  Z24 .  When  insert¬ 
ing  a  number  9,  the  following  canonical  ranges  are  inserted: 

[9. 10) ,  [8, 10),  [8, 12),  [8, 16).  Given  a  range  query  [7, 11), 
the  following  canonical  ranges  are  checked:  [7,  8),  [10, 11), 

[8. 10) .  We  have  a  match  [8, 10). 

Negation  conditions  can  be  easily  changed  to  range  con¬ 
ditions;  for  example,  a  condition  'not  work  hrs  =  40'  is 
equivalent  to  'work  hrs  <  39  OR  work  hrs  >  41'. 

Overall  procedure  in  a  node.  In  summary,  we  describe  the 
protocol  that  the  client  and  the  index  server  execute  in  a  node 
of  the  search  tree. 

1)  The  client  constructs  a  query  circuit  corresponding  to 
the  given  SQL  query.  Then,  it  garbles  the  circuit  and 
sends  the  garbled  circuit,  Yao  keys  for  its  input,  and  the 
necessary  BF  indices. 

2)  The  client  and  the  index  server  execute  OT  so  that  IS 
obtains  Yao  keys  for  its  input  (i.e.,  encrypted  BF).  Then, 
the  index  server  evaluates  the  garbled  circuit  and  sends 
the  resulting  output  Yao  key  to  the  client. 

3)  The  client  decides  whether  to  proceed  based  on  the 
result. 

Record  Retrieval.  When  the  client  and  the  index  server 
reach  some  of  the  leaf  nodes  in  the  tree,  the  client  retrieves 
the  associated  records.  In  particular,  if  computing  the  query 
circuit  on  the  rth  leaf  outputs  success,  the  index  server  sends 
Ri)  to  the  client.  Then,  the  client  sends  t/>(i)  to  S, 
and  gets  back  Note  that  it  holds  :=  SiVi.  The  client 
C  decrypts  Ri  using  Si  and  obtains  the  output  record. 

V.  Advanced  Features 

In  this  section,  we  discuss  how  our  system  supports  advanced 
features  such  as  query  policies,  and  one-case  indistinguisha- 
bility.  We  also  overview  insert/delete/update  operations  from 
the  server. 

A.  Policy  Enforcement 

The  policy  enforcement  is  performed  through  a  three-party 
protocol  among  the  query  checker  QC  (holding  the  policy), 
the  client  C  (holding  the  query),  and  the  index  server  IS.  A 


policy  is  represented  as  a  circuit  that  takes  a  query  as  input 
and  outputs  accept  or  reject.  In  our  system,  QC  garbles  this 
policy  circuit,  and  IS  evaluates  the  garbled  policy  circuit  on 
the  client’s  query.  A  key  idea  here  is  to  have  the  client  and 
the  query  checker  share  the  information  of  input/output  wire 
key  pairs  in  this  garbled  policy  circuit;  then,  the  client  can 
later  construct  a  garbled  query  circuit  (used  in  the  search  tree 
traversal)  to  be  dependent  on  the  output  of  the  policy  circuit. 
Assuming  semi-honest  security,  this  sharing  of  information  can 
be  easily  achieved  by  having  the  client  choose  those  key  pairs 
(instead  of  QC)  and  send  them  to  QC.  The  detailed  procedure 
follows. 

Before  the  tree  search  procedure  described  in  the  previous 
section  begins,  the  client  C,  the  query  checker  QC,  and  the 
index  server  IS  execute  the  following  protocol. 

1)  Let  q  =  {qi, . . .  ,qm)  G  {0, 1}’”  be  a  string  that 

encodes  a  query  (we  will  discuss  our  encoding  method 
in  Appendix  A).  The  client  generates  Yao  key  pairs 
Wq  =  ((W]',  wj), . . . ,  ru)^))  for  the  input  wires 

of  the  policy  circuit,  and  a  key  pair  Wx  =  for 

the  output  wire.  The  client  sends  the  key  pairs  Wq  and 
Wx  to  query  checker  QC.  It  also  sends  the  index  server 
the  garbled  input  q  =  {wf ,  , . . . ,  ). 

2)  Let  P  be  the  policy  circuit.  QC  generates  a  garbled 
circuit  P  using  Wq  as  input  key  pairs,  and  Wx  as  the 
output  key  pair  (QC  chooses  the  other  key  pairs  of  P  at 
random).  Then,  QC  sends  P  to  the  index  server. 

3)  The  index  server  evaluates  the  circuit  P  on  q  obtaining 
the  output  wire  key  x  =  P(q).  Note  that  x  G  Wx- 

After  the  execution  of  this  protocol,  the  original  search  tree 
procedure  starts  as  described  before.  However,  the  procedure 
is  slightly  changed  when  evaluating  a  leaf  node  as  follows: 

1)  Let  Q'ih,  r,  x)  =  Q{h,  r)  A  a:  be  an  augmented  circuit, 
where  Q  is  the  query  circuit,  b  and  r  are  the  inputs 
from  IS  and  C  respectively,  and  a;  is  a  bit  representing 
the  output  from  the  policy  circuit.  The  client  C  generates 
a  garbled  query  circuit  Q'  using  wire  key  pair  Wx  for  the 
bit  X.  Then,  it  sends  {Q',r)  to  the  index  server,  where 
f  is  the  garbled  input  of  r. 

2)  After  obtaining  the  input  keys  b  for  b  from  OT  with 
C,  the  index  server  IS  evaluates  Q'{h,r,x)  and  sends 
the  resulting  output  key  to  the  client.  Recall  that  it  has 
already  evaluated  the  garbled  policy  circuit  P(q)  and 
obtained  x. 

3)  The  client  checks  the  received  key  and  decides  to  accept 
or  reject. 

Regarding  privacy,  the  client  learns  nothing  about  the  policy, 
since  it  never  sees  the  garbled  policy  circuit.  The  index  server 
obtains  the  topology  of  the  policy  circuit  (from  the  garbled 
policy  circuit). 

Note  that  the  garbled  policy  circuit  is  evaluated  only  once, 
before  the  search  tree  execution  starts.  Therefore,  the  policy 
checking  mechanism  introduces  only  a  small  overhead.  It  is 
also  worth  observing  that,  so  far,  we  have  not  assumed  any 
restriction  on  the  policy  to  be  evaluated.  Since  Yao-based 


computation  can  compute  any  function  represented  as  a  circuit, 
in  principle,  we  could  enforce  any  policy  computable  in  a 
reasonable  time  (as  long  as  it  depends  only  on  the  query).  We 
describe  our  own  implemented  policy  circuit  in  more  detail  in 
Appendix  A. 

B.  One-case  Indistinguishability 

So  far,  in  our  system  the  index  server  learns  how  many 
records  the  client  retrieved  from  the  query.  In  many  use  cases, 
this  leakage  should  be  insignificant  to  the  index  server,  in 
particular,  when  the  number  of  returned  results  is  expected  to 
be,  say,  more  than  a  hundred.  However,  there  do  exist  some  use 
cases  in  which  this  leakage  is  critical.  For  example,  suppose 
that  a  government  agent  queries  the  passenger  database  of 
an  airline  company  looking  for  persons  of  interest  (POI).  We 
assume  that  the  probability  that  there  is  indeed  a  POI  is  small, 
and  the  airline  or  the  index  server  discovering  that  a  query 
resulted  in  a  match  may  cause  panic.  Motivated  from  the  above 
scenario,  we  consider  a  security  notion  which  we  call  one-case 
indistinguishability. 

Motivation.  Consider  a  triple  {q,Do,r)  where  g  is  a  query, 
and  Dq  is  a  database  with  the  query  q  resulting  in  no  record, 
but  r  satisfies  q.  Let  Di  be  a  database  that  is  the  same  as 
Dq  except  that  a  record  is  randomly  chosen  and  replaced  with 
r.  Let  VIEWq  (resp.  ViEWi)  denote  the  view  of  IS  when  the 
client  runs  a  query  q  with  the  database  Dq  (resp.,  Di). 

A  natural  start  would  be  to  require  that  for  any  such 
{q,  Dq,  r),  the  difference  between  the  two  distributions  VIEWq 
and  ViEWi  should  be  small  e  (in  the  computational  sense), 
which  we  call  e  zero-one  indistinguishability.  However,  it  does 
not  seem  possible  to  achieve  negligible  difference  e  without 
suffering  significant  performance  degradation  (in  fact,  our 
system  satisfies  this  notion  for  a  tunable  small  constant  e). 
Unfortunately,  this  definition  does  not  provide  a  good  security 
guarantee  when  the  difference  e  is  non-negligible,  in  particular, 
for  the  scenario  of  finding  POIs.  For  example,  let  H  be  a 
database  system  with  perfect  privacy  and  H'  be  the  same  as 
n  except  that  when  it  is  1-case  (i.e.,  a  query  with  one  result 
record),  the  client  sends  the  index  server  the  message  “the  1- 
case  occurred”  with  non-negligible  probability.  It  is  easy  to 
see  that  H'  satisfies  the  definition  with  some  non-negligible  e, 
but  it  is  clearly  a  bad  and  dangerous  system. 

One-case  indistinguishability.  Observe  that  in  the  use  case  of 
finding  POIs,  we  don’t  particularly  worry  about  “the  0-case”, 
that  is,  it  is  acceptable  if  the  airline  company  sometimes  knows 
that  a  query  definitely  resulted  in  no  returned  record.  Mo¬ 
tivated  by  this  observation,  this  definition  intuitively  requires 
that  if  the  a-priori  probability  of  a  1-case  is  6,  then  a-posteriori 
probability  of  a  1-case  is  at  most  (lH-e)5.  For  example,  for 
e  =  1,  the  probability  could  grow  from  6  to  26,  but  never 
more  than  that,  no  matter  what  random  choices  were  made. 
Moreover,  if  the  a-priori  probability  was  tiny,  the  a-posteriori 
probability  remains  tiny  even  if  unlucky  random  choices  were 
made.  In  particular,  consider  {q,  Dq,  r)  and  Di  as  before.  Now 
consider  a  distribution  E  that  outputs  {b,v)  where  b  G  {0, 1} 


chosen  with  Pr[6  =!]=(),  and  v  is  the  view  of  the  index 
server  when  the  query  q  is  run  on  D^.  The  system  satisfies  e 
one-case  indistinguishability  if  for  any  {q,DQ,r),  6  and  v,  it 
holds 

Pr[fo  =  l|v]  <  (1  +  e)S. 

Augmenting  the  design.  To  achieve  these  indistinguishability 
notions,  we  change  the  design  such  that  the  client  chooses  a 
small  random  number  of  paths  leading  to  randomly  selected 
leaves.  In  particular,  let  D  be  the  probability  distribution  on 
the  number  of  random  paths  defined  as  follows: 

Distribution  D:  For  1  <  a;  <  a  —  1,  Prx)[a:]  =  1/a. 

For  X  >  a,  Prx)[a:]  =  {1/a)  ■  1/2“““+^. 

Here,  a  is  a  tunable  parameter.  The  client  chooses  XG-  T), 
and  then  it  also  chooses  x  random  indices  {ji, . . .  ,jx)^  [n]^ . 
When  handling  the  query,  the  client  superimposes  the  basic 
search  procedure  above  with  these  random  paths.  Our  system 
is  1/a  zero-one  indistinguishable  and  e  one-case  indistinguish¬ 
able  with  e  =  1.  Intuitively,  the  leakage  to  the  index  server  is 
the  tree  traversal  pattern,  and  these  additional  random  paths 
make  the  0-case  look  like  1-case  with  a  reasonably  good 
probability.  We  give  more  detail  in  Appendix  B. 

If  we  slightly  relax  the  definition  and  ignore  views  tak¬ 
ing  place  with  a  tiny  probability,  say  we  can  even 

achieve  both  1-case  and  0-case  indistinguishability  at  the  same 
time;  the  probability  of  the  number  x  of  fake  paths  is  now 
l/2l“““l+^  with  a  parametrized  center  a,  say  a  =  20  (except 
when  X  =  0,  i.e.,  Pr[a;  =  0]  =  1/2“+^). 

Against  the  server.  One-case  indistinguishability  against  the 
server  is  easily  achieved  by  generating  a  sufficient  number  of 
dummy  record  decryption  keys  in  the  preprocessing  phase;  the 
index  server  will  let  the  client  know  the  (permuted)  positions 
of  the  dummy  keys.  When  zero  records  are  returned  from  a 
query,  the  client  asks  for  a  dummy  decryption  key  from  the 
server.  For  brevity,  we  omit  the  details  here,  and  exclude  this 
feature  in  the  security  analysis. 

C.  Delete,  Insert,  and  Update  from  the  Server 
Our  system  supports  a  basic  form  of  dynamic  deletion, 
insertion,  and  update  of  a  record  which  is  only  available 
to  the  server.  If  it  would  like  to  delete  a  record  Ri,  then 
the  server  sends  i  to  the  index  server,  which  will  mark 
the  encrypted  correspondent  as  deleted.  For  newly  inserted 
(encrypted)  records,  the  index  server  keeps  a  separate  list  for 
them  with  no  permutation  involved.  In  addition,  it  also  keeps 
a  temporary  list  of  their  Bloom  filters.  During  search,  the 
temporary  list  is  also  scanned  linearly,  after  the  tree.  When 
the  length  of  the  temporary  Bloom  filter  list  reaches  a  certain 
threshold,  all  the  current  data  is  re-indexed  and  a  new  Bloom 
filter  tree  is  constructed.  The  frequency  of  rebuilding  the  tree 
is  of  course  related  to  the  frequency  of  the  modifications  and 
also  the  threshold  we  choose  for  the  temporary  list’s  size.  Our 
tree  building  takes  one  hour/lOOM  records.  Finally,  update  is 
simply  handled  by  atomically  issuing  a  delete  and  an  insert 
command. 


Functionality  Tdb 

Parameter:  Leakage  profile. 

Init:  Given  input  (D,  P)  from  S,  do  the  following: 

1)  Store  the  database  records  D  and  the  policy  P.  Let 
n  be  the  number  records  in  D.  Shuffle  D  and  let 
{Ri, . . . ,  Rn)  be  the  shuffled  records.  Choose  a  random 
permutation  rr  :  [n]— ^[n].  Construct  a  BF-search  tree  for 
{Ri, . . . ,  Rn)  using  the  hash  functions  R. 

2)  To  handle  the  client’s  queries,  it  chooses  hash  functions 
R  =  {hi  :  {0, 1}*— for  Bloom  filters  with 
parameters  (r/,  £)  to  maintain  false  positive  rate  of  10“®. 

3)  Finally,  return  a  DONEjnit  and  the  leakage  to  all  parties. 

Query:  Given  input  q  from  C,  do  the  following: 

1)  Check  if  q  is  allowed  by  P.  If  the  check  fails,  then 
disallow  the  query  by  setting  y  =  Otherwise,  for  each 
i  G  [n],  let  Bi  G  {0, 1}^  be  the  Bloom  filter  associated 
with  the  ith  leaf  in  the  BF  tree.  For  i  =  1, . . . ,  n,  check 
if  the  query  passes  according  to  the  filter  Bi  (refer  to 
Section  II);  if  so,  add  {i,Ri)  to  the  result  set  Y. 

2)  Return  y  to  C  and  return  a  DONEquery  message  and 
leakage  to  all  parties. 


Figure  4.  The  Ideal  Functionality  p^b 


We  note  that  updates  is  not  our  core  contribution;  we 
implement  and  report  it  here,  but  don’t  focus  on  its  design 
and  performance.  A  more  scalable  update  system  would  use 
a  BF  tree  rather  than  a  list;  its  implementation  is  a  simple 
modification  to  our  system. 

VI.  Security  Analysis 

In  this  section,  we  present  an  overview  of  the  security  of  our 
system.  A  full  analysis  with  formal  definitions  and  extensive 
proofs  is  completed  and  written  separately. 

We  consider  static  security  against  a  semi-honest  adversary 
that  controls  at  most  one  participant.  We  first  describe  an 
ideal  functionality  Rdb  parameterized  with  a  leakage  profile 
in  Figure  4,  and  then  show  that  our  system  securely  realizes 
the  functionality  where  the  leakage  is  essentially  the  search 
tree  traversal  pattern  and  the  pattern  of  accessed  BF  indices. 

For  the  sake  of  simplicity,  we  only  consider  security  where 
there  are  no  insert/delete/update  operations, ^and  unify  the 
server  and  the  query  checker  into  one  entity.  We  also  assume 
that  all  the  records  have  the  same  length. 

We  use  the  DDH  assumption  (for  ElGamal  encryption  and 
Naor-Pinkas  OT),  and  our  protocols  are  in  the  random  oracle 
model  (for  Naor-Pinkas  OT  and  OT  extension).  We  also  use 
PRGs  and  PRFs,  and  those  primitives  are  implemented  with 
AES. 

^  As  access  patterns  are  revealed,  additional  information  for  in¬ 
serted/deleted/updated  records  is  leaked.  For  example,  C  or  IS  may  learn 
whether  a  returned  record  was  recently  inserted;  they  also  may  get  advantage 
in  estimating  whether  the  query  matched  a  recently  deleted  record.  We  stress 
that  this  additional  leakage  can  be  removed  by  re-running  the  setup  of  the 
search  structure. 


A.  Security  of  Our  System 

With  empty  leakage  profile,  the  ideal  functionality  Rdi,  in 
Eigure  4  captures  the  privacy  requirement  of  a  database 
management  system  in  which  each  query  is  handled  deter¬ 
ministically.  The  client  obtains  only  the  query  results,  but 
nothing  more.  The  index  server  and  the  server  learn  nothing. 
Realizing  such  a  functionality  incurs  a  performance  hit.  Our 
system  realizes  the  functionality  Tdb  with  the  leakage  profile 
described  below.  The  security  of  our  system  can  be  proved 
from  the  security  of  the  secure  computation  component,  and 
is  deferred  to  the  full  version. 

Leakage  in  Init.  Since  the  server  has  all  the  input,  the  leakage 
to  S  is  none.  The  leakage  to  C  is  n,  that  is,  the  total  number 
of  records.  The  leakage  to  IS  is  n  and  |i?i|. 

Leakage  to  S  in  each  query.  We  first  consider  the  leakage 
to  the  server.  The  server  is  involved  only  when  the  record  is 
retrieved.  Let  {{ii,Rif), . . .  ,(ij,Ri.))  be  the  query  results. 
Then,  the  leakage  to  the  server  is  (7r(zi),  7r(z2), . . .  ,7r(ij)). 

Leakage  to  C  in  each  query.  The  leakage  to  the  client  is 
the  BE-search  tree  traversal  paths,  that  is,  all  the  nodes  v  in 
which  the  query  passes  according  to  the  filter  By. 

Leakage  to  IS  in  each  query.  The  leakage  to  the  index 
server  is  a  little  more  than  that  to  the  client.  In  particular, 
the  nodes  in  the  faked  paths  that  the  client  generates  due 
to  one-case  indistinguishability  are  added  to  the  tree  search 
pattern.  Also,  the  topology  of  the  query  circuit  and  of  the 
policy  circuit  is  leaked  to  IS  as  well.  Einally,  the  BE  indices  are 
also  revealed  to  IS  (although  not  the  BE  content),  but  assuming 
that  the  hash  functions  are  random,  those  indices  reveal  little 
information  about  the  query.  However,  based  on  this,  after 
observing  multiple  queries,  IS  can  infer  some  correlations  a 
C’s  queries’  keywords. 

B.  Discussion 

Leakage  to  the  server.  We  could  wholly  remove  the  leakage 
to  the  server  by  modifying  the  protocol  as  follows: 

Remove  the  decryption  key  preparation  (and  blinded 
keys)  in  the  preprocessing;  instead,  the  client  re¬ 
ceives  the  secret  key  sk  from  the  server.  The  client 
(as  the  receiver)  and  the  index  server  (as  the  sender) 
execute  oblivious  transfer  at  each  leaf  of  the  search 
tree.  The  choice  bit  of  the  client  is  whether  the 
output  of  the  query  circuit  is  success.  The  two 
messages  of  the  index  server  is  the  encrypted  record 
and  a  string  of  zeros. 

However,  we  believe  that  it  is  important  for  the  server  to  be 
able  to  upper-bound  the  number  of  retrieved  records.  Without 
such  control,  misconfiguration  on  the  query  checker  side  may 
allow  overly  general  queries  to  be  executed,  causing  too  many 
rows  to  be  returned  to  the  client;  in  contrast,  in  our  approach, 
S  releases  record  decryption  keys  at  the  end,  and  therefore 
it  is  easy  to  enforce  the  sanity  check  of  the  total  number  of 
returned  records.  Moreover,  if  S  has  a  commercial  DB,  it  may 


be  convenient  to  implement  payment  mechanism  in  association 
with  key  release  by  S. 

OR  queries.  For  OR  queries  passing  the  policy,  our  system 
leaks  extremely  small  information.  In  particular,  the  leakage 
to  the  client  is  minimal,  as  the  tree  traversal  pattern  can  be 
reconstructed  from  the  returned  records.  As  a  consequence,  if 
the  client  retrieves  only  document  ids,  the  client  learns  nothing 
about  the  results  for  individual  terms  in  his  query.  The  leakage 
to  the  index  server  is  similar.  We  believe  that  the  topology 
of  the  SQL  formula  and  the  policy  circuit  reveals  small 
information  about  the  query  and  the  policy.  If  desired,  we  can 
even  hide  those  information  using  universal  circuits  [37]  with 
a  circuit  size  blow-up  of  a  logarithmic  multiplicative  factor. 
AND  queries.  For  AND  queries,  the  tree  traversal  pattern 
consists  of  two  kinds  of  paths.  The  first  are,  of  course,  the 
paths  reaching  the  leaves  (query  results).  The  second  stop 
at  some  internal  nodes  due  to  our  BF  approach^;  although 
the  leakage  from  this  pattern  reveals  more  information  about 
which  node  don’t  contain  a  given  keyword,  we  still  believe 
this  leakage  is  acceptable  in  many  use  cases. 

We  stress  that  the  second  leakage  is  related  to  the  fact  that 
a  large  linear  running  time  seems  to  be  inherent  for  some 
AND  queries,  irrespective  of  privacy,  but  depending  only  on 
the  underlying  database  (see  Section  VIII-C  for  more  detail). 
Therefore,  if  we  aim  at  running  most  AND  queries  in  sublinear 
time,  the  running  time  will  inherently  leak  information  on  the 
underlying  DB. 

VII.  Implementation 

We  built  a  prototype  of  the  proposed  system  to  evaluate 
its  practicality  in  terms  of  performance.  The  prototype  was 
developed  from  scratch  in  C-n-  (a  more  than  a  year  effort, 
almost  two  years  including  designing)  and  consists  of  about 
lOKLOC.  In  this  section,  we  describe  several  interesting  parts 
of  the  implementation  that  are  mostly  related  to  the  scalability 
of  the  system. 

Crypto  building  blocks.  We  developed  custom  implemen¬ 
tations  for  all  the  cryptographic  building  blocks  that  were 
previously  described  in  Section  II.  More  specifically,  we 
used  the  GNU  Multiple  Precision  (GMP)  library  to  im¬ 
plement  oblivious  transfers,  garbled  circuits  and  the  semi- 
homomorphic  key  management  protocol.  The  choice  of  GMP 
was  mostly  based  on  thread-safety.  As  for  AES-based  PRF,  we 
used  the  OpenSSL  implementation  because  it  takes  advantage 
of  the  AES-NI  hardware  instructions,  thus  delivering  better 
performance. 

Parallelization.  The  current  implementation  of  Blind  Seer 
supports  parallel  preprocessing  and  per-query  threading  when 
searching.  Eor  all  the  multi-threading  features  we  used  Intel’s 
Threading  Building  Blocks  (TBB)  library.  To  enable  multi¬ 
threaded  execution  of  the  preprocessing  phase  we  created 

^  Eor  example,  consider  a  query  q  that  looks  for  two  keywords,  say,  q  = 
aA/3.  Let  v  be  some  node  and  ci , . . . ,  c;,  be  the  children  of  v  in  the  search 
tree.  If  ci  contains  only  a,  and  C2  contains  only  /3,  then  v  will  contain  both 
a  and  /3,  and  so  the  node  v  will  pass  the  query;  however,  neither  ci  nor  C2 
would. 


a  3-stage  pipeline.  The  first  stage  is  single-threaded  and  it 
is  responsible  for  reading  the  input  data.  The  second  stage 
handles  record  preprocessing.  This  stage  is  executed  in  parallel 
by  a  pool  of  threads.  Einally,  the  last  stage  is  again  single- 
threaded  and  is  responsible  for  handling  the  encrypted  records. 
Concurrently  supporting  multiple  queries  was  straightforward 
as  all  the  data  structures  are  read-only.  To  avoid  accessing  the 
Bloom  filter  tree  while  it  is  being  updated  by  a  modification 
command,  we  added  a  global  writer  lock  (which  does  not 
block  reads).  Since  we  only  currently  support  paralleliza¬ 
tion  on  a  one-thread-per-query  basis,  it  only  benefits  query 
throughput,  not  latency.  However,  long-running  queries  involve 
a  large  amount  of  interaction  between  querier  and  server 
that  is  independent  and  thus  amenable  to  parallelization.  The 
improvement  we  see  in  throughput  is  a  good  indicator  for  how 
much  we  could  improve  latency  of  slow  queries  by  applying 
parallelization  to  these  interactions. 

Bloom  filter  tree.  This  is  the  main  index  structure  of 
our  system  which  grows  by  the  number  of  records  and  the 
supported  features  (e.g.,  range).  Eor  this  reason,  the  space 
efficiency  of  the  Bloom  filter  tree  is  directly  related  to  the 
scalability  of  the  system.  In  the  current  version  of  our  system 
we  have  implemented  two  space  optimizations:  one  on  the 
representation  of  the  tree  and  another  on  the  size  of  Bloom 
filter  in  each  tree  node. 

Eirstly,  we  avoided  storing  pointers  for  the  tree  represen¬ 
tation,  which  would  result  in  wasting  almost  IG  of  memory 
for  lOOM  records.  This  is  achieved  by  using  a  flat  array  with 
fixed  size  allocations  per  record. 

Secondly,  we  observed  that  naively  calculating  the  number 
of  items  stored  in  the  inner  nodes  by  summing  the  items  of 
their  children  is  inefficient.  Eor  example,  consider  the  case 
of  storing  the  ‘Sex’  field  in  the  database,  which  has  only 
two  possible  values.  Each  Bloom  filter  in  the  bottom  layer 
of  the  tree  (leaves)  will  store  either  the  value  sex: male  or 
sex :  female.  However,  their  parent  nodes  will  keep  space  for 
10  items,  although  the  Sex  field  can  have  only  two  possible 
values.  Thus,  we  estimate  the  number  of  items  that  need  to  be 
stored  in  a  given  level  as  the  minimum  between  the  cardinality 
of  the  field  and  the  number  of  leaf-nodes  of  the  current  subtree. 
This  optimization  alone  reduced  the  total  space  of  the  tree  by 
more  than  50%  for  the  database  we  used  in  our  evaluation. 
Keyword  search  and  stemming.  Although  we  focus  on 
supporting  database  search  on  structured  data,  our  underlying 
system  works  with  collections  of  keywords.  Thus,  it  can 
trivially  handle  other  forms  of  data,  like  keyword  search 
over  text  documents,  or  even  keyword  search  on  text  fields 
of  a  database.  We  actually  do  support  the  latter  -  in  our 
system  we  provide  this  functionality  using  the  special  oper¬ 
ator  C0NTAINED_IN  (column,  keyword) .  Also,  we  support 
stemming  over  keyword  search  by  using  the  Porter  stemming 
algorithm  [2]. 

VIII.  Evaluation 

In  this  section,  we  evaluate  our  system.  We  first  evaluate  our 
system  as  a  comparison  with  MySQL  as  a  baseline,  to  establish 
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Figure  5.  Comparison  with  MySQL  for  single-term  queries  that  have  a  single 
result  (first  four  bar  groups)  and  2  to  10  results  (last  four  bar  groups).  The 
search  terms  are  either  strings  (str)  or  integers  (int)  and  the  returned  result  is 
either  the  id  or  the  whole  record  (star). 


Figure  6.  Comparison  of  the  scaling  factor  with  respect  to  the  result  set 
size,  using  single-term  queries.  Both  MySQL  and  Blind  Seer  scale  linearly, 
however.  Blind  Seer’s  constant  factor  is  15  x  worse  (mostly  due  to  increased 
network  communication). 


what  the  performance  cost  of  providing  private  search  is.  We 
then  generalize  the  performance  expectations  of  our  system  by 
performing  a  theoretical  analysis  based  on  the  type  of  queries. 
Dataset.  The  dataset  we  use  in  all  of  our  tests  for  the  first 
part  of  the  evaluation  is  a  generated  dataset  using  learned 
probability  distributions  from  the  US  census  data  and  text 
excerpts  from  “The  Call  of  the  Wild”,  by  Jack  London.  Each 
record  in  our  generated  database  contains  personal  information 
generated  with  similar  distributions  to  the  census.  It  also 
contains  a  globally  unique  ID,  four  fields  of  random  text 
excerpts  ranging  from  10  —  2000  bytes  from  “The  Call  of  the 
Wild”,  and  a  “fingerprint”  payload  of  random  data  ranging 
from  50000  to  90000  bytes.  The  payload  is  neither  searchable 
nor  compressible,  and  is  included  to  emulate  reasonable  data 
transfer  costs  for  real-world  database  applications.  The  census 
data  fields  are  used  to  enable  various  types  of  single-term 
queries  such  as  term  matching  and  range  queries,  and  the  text 
excerpts  for  keyword  search  queries. 

Testbed.  Our  tests  were  run  on  a  four-computer  testbed 
that  Lincoln  Labs  set  up  and  programmed  for  the  purpose 
of  testing  our  system  and  comparing  it  to  MySQL.  Each 
server  was  configured  with  two  Intel  Xeon  2.66  Ghz  X5650 
processors,  96GB  RAM  (12x8  GB,  1066  MHz,  Dual  Ranked 
LV  RDIMMs),  and  an  embedded  Broadcom  1GB  Ethernet 
NICS  with  TOE.  Two  servers  were  equipped  with  a  50TB 
RAID5  array,  and  one  with  a  20TB  array.  These  were  used 
to  run  the  owner  and  index  server.  MySQL  was  configured 
to  build  separate  indices  for  each  field.  DB  queries  were  not 
known  in  advance  for  MySQL  or  for  our  system. 

A.  Querying  Performance 

Single  term  queries  with  a  small  result  set.  Eigure  5 
shows  a  comparison  of  single  term  queries  against  MySQL. 
We  expect  the  run  time  for  both  our  system  and  MySQL  to 
depend  primarily  on  the  number  of  results  returned.  The  first 
four  pairs  show  average  and  standard  deviation  for  query  time 
on  queries  with  exactly  one  result  in  the  entire  database,  and 
the  latter  four  for  queries  with  a  few  (2-10)  results.  Queries 


are  further  grouped  into  those  which  are  run  on  integer  fields 
(int)  and  string  fields  (str),  and  those  which  return  only  record 
ids  (id)  and  those  which  return  full  record  content  (star).  Eor 
each  group,  we  executed  200  different  queries  to  avoid  caching 
effects  in  MySQL. 

As  we  can  see,  for  single  result  set  queries,  our  system 
is  very  consistent.  Unlike  with  MySQL,  the  type  of  query 
has  no  effect  on  performance,  since  all  types  are  stored 
and  queried  the  same  way  in  the  underlying  Bloom  filter 
representation.  Also,  the  average  time  is  dominated  by  the 
average  number  of  results,  which  is  slightly  larger  for  integer 
terms.  Unexpectedly,  there  is  also  no  performance  difference 
for  returning  record  ids  versus  full  records.  This  is  likely 
because  for  a  single  record,  the  performance  is  dominated 
by  other  factors  like  circuit  evaluation,  tree  traversal  and  key 
handling,  rather  than  record  transfer  time.  Overall,  aside  from 
some  bad-case  scenarios,  we  are  generally  less  than  2x  slower. 

Variation  in  performance  of  our  system  is  much  larger  when 
returning  a  few  results.  This  is  because  the  amount  of  tree 
traversal  that  occurs  depends  on  how  much  branching  must 
occur.  This  differs  from  single  result  set  queries,  where  each 
tree  traversal  is  a  single  path.  With  the  larger  result  sets,  we 
can  also  begin  to  see  increased  query  time  for  full  records  as 
opposed  to  record  ids,  although  it  remains  a  small  portion  of 
the  overall  run  time. 

Scaling  with  result  set  size.  Eigure  6  expands  on  both 
systems’  performance  scaling  with  the  number  of  results 
returned.  This  experiment  is  also  run  with  single  term  queries, 
but  on  a  larger  range  of  return  result  set  sizes.  As  one  would 
expect,  the  growth  is  fairly  linear  for  both  systems,  although 
our  constant  factor  is  almost  15  x  worse.  This  indicates  that  for 
queries  with  a  small  result  set,  the  run  time  is  dominated  by 
additive  constant  factors  like  connection  setup  for  which  we 
are  not  much  slower  than  MySQL.  However,  the  multiplicative 
constant  factors  involved  in  our  interactive  protocol  are  much 
larger,  and  grow  to  dominate  run  time  for  longer  running 
queries.  This  overhead  is  mostly  due  to  increased  network 
communication  because  of  the  interactiveness  of  the  search 
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Figure  7.  Boolean  queries  having  a  few  results  (<  10).  The  first  three  are 
two-term  AND  queries  where  one  of  the  terms  has  a  single  result  and  the 
other  varies  from  1  to  lOK  results.  The  fourth  group  includes  monotonic  DNF 
queries  with  4-9  terms,  the  last  includes  5-term  DNF  queries  with  negations. 

protocol.  Although  this  is  inherent,  we  believe  that  there  is 
room  for  implementation  optimizations  that  could  lower  this 
constant  factor. 

Boolean  queries.  Figure  7  shows  our  performance  on  various 
Boolean  queries.  The  first  three  groups  show  average  query 
time  for  2-tetm  AND  queries.  In  each  case,  one  term  occurs 
only  once  in  the  database,  resulting  in  the  overall  Boolean 
AND  having  only  one  match  in  the  database.  However,  the 
second  term  increases  up  to  10000  results  in  the  database. 
As  we  can  see,  our  query  performance  does  not  suffer;  as 
long  as  at  least  one  term  in  a  Boolean  is  infrequent  we  will 
perform  well.  The  next  two  groups  are  more  complex  Boolean 
queries  issued  in  disjunctive  normal  form,  the  latter  including 
negations.  The  first  one  includes  queries  with  4-9  terms,  and 
the  second  one,  with  5  terms.  These  incur  a  larger  cost,  as 
the  number  of  a  results  is  larger  and  possibly  a  bigger  part 
of  the  tree  is  explored.  As  we  can  see,  MySQL  incurs  a 
proportionally  similar  cost. 

We  note  that  the  relatively  large  variation  shown  in  the  graph 
is  due  to  the  different  queries  used  in  our  test.  Variation  is 
much  smaller  when  we  run  the  same  query  multiple  times. 

Parallelization.  We  have  implemented  a  basic  form  of 
parallelization  in  our  system,  which  enables  it  to  execute 
multiple  queries  concurrently.  As  there  are  no  critical  sections 
or  concurrent  modifications  of  shared  data  structures  during 
querying,  we  saw  the  expected  linear  speedup  when  issuing 
many  queries  up  to  a  point  where  the  CPU  might  not  be 
the  bottleneck  anymore.  In  our  16-core  system,  we  achieved 
approximately  factor  6x  improvement  due  to  this  crude  paral¬ 
lelization. 

Discussion.  We  note  several  observations  on  our  system, 
performance,  bottlenecks,  etc. 

Firstly,  we  note  that  our  experiments  are  run  on  a  fast  local 
network.  A  natural  question  is  how  this  would  be  translated 
into  the  higher-latency  lower  bandwidth  setting.  Firstly,  there 
will  be  performance  degradation  proportional  to  bandwidth 
reduction,  with  the  following  exception.  We  could  use  the 
slightly  more  computationally-expensive,  but  much  less  com¬ 


munication  intensive  GESS  protocol  of  [34]  or  its  recent 
extension  sliced-GESS  [35],  instead  of  Yao’s  GC.  In  reduced- 
bandwidth  settings,  where  bandwidth  is  the  bottleneck,  sliced- 
GESS  is  about  3x  more  efficient  than  most  efficient  Yao’s 
GC.  Eurther,  we  can  easily  scale  up  parallelization  factor  to 
mitigate  latency  increases.  Looking  at  this  in  a  contrapositive 
manner,  improving  network  bandwidth  and  latency  would 
make  CPU  the  bottleneck. 

All  search  structures  in  our  system  are  RAM-resident.  Only 
the  record  payloads  are  stored  on  disk.  Thus,  disk  should  not 
be  a  bottleneck  in  natural  scenarios. 

B.  Other  Operations 

Although  querying  is  the  main  operation  of  our  system,  we 
also  include  some  results  of  other  operations.  Eirst,  we  start 
with  the  performance  of  the  setup  phase  (preprocessing).  Blind 
Seer  took  roughly  two  days  to  index  and  encrypt  the  10TB 
data.  As  mentioned  before,  this  phase  is  executed  in  parallel 
and  is  computationally  efficient  enough  to  be  lO-bounded  in 
our  testbed.  We  note  that  the  corresponding  setup  of  MySQL 
took  even  longer. 

Policy  enforcement  was  another  feature  for  which  we 
wanted  to  measure  overhead.  However,  in  our  current  imple¬ 
mentation,  it  cannot  be  disabled  (instead,  we  use  a  dummy 
policy).  We  experimentally  measured  the  overhead  of  enforc¬ 
ing  the  dummy  policy  versus  more  complex  ones,  but  there 
was  no  noticeable  difference.  We  plan  to  add  the  functionality 
to  totally  disable  policy  enforcement  -  because  it  is  an  optional 
feature  -  and  measure  its  true  performance  overhead.  Our 
expectation  is  that  it  will  be  minimal. 

Einally,  we  performed  several  measurements  for  the  sup¬ 
ported  modification  commands;  insert,  update  and  delete.  All 
of  them  execute  in  constant  time  in  the  order  of  a  few  hundred 
microseconds.  The  more  expensive  part  though  is  the  periodic 
re-indexing  of  the  data  that  merges  the  temporary  Bloom  hlter 
list  in  the  tree  (see  Section  V-C).  In  our  current  prototype, 
we  estimated  this  procedure  to  take  around  17  minutes,  while 
avoiding  re-reading  the  entire  database.  This  can  be  achieved 
by  letting  the  server  store  some  intermediate  indexing  data 
during  the  initial  setup  and  reusing  it  later  when  constructing 
the  Bloom  filter  tree. 

C.  Theoretical  Performance  Analysis 

In  this  section,  we  discuss  the  system  performance  for  various 
queries  by  analyzing  the  number  of  visited  nodes  in  the  search 
tree.  Let  ai,...,ak  be  k  single  term  queries,  and  for  each 
i  G  [/c],  let  Vi  be  the  number  of  returned  records  for  the  query 
ai,  and  n  be  the  total  number  of  records. 

OR  queries.  Our  system  shows  great  performance  with  OR 
queries.  In  particular,  consider  a  query  ai  V  •  •  •  V  The 
number  of  visited  nodes  in  the  search  tree  is  at  most  r  log^g  n, 
where  r  =  ri  -f  . . .  -f  is  the  number  of  returned  records. 
Therefore,  performance  scales  with  the  size  of  the  result  set, 
just  like  single  term  queries. 

AND  queries.  The  performance  depends  on  the  best  con¬ 
stituent  term.  Eor  the  AND  query  ctiA  •  •  •  Aa^,  the  number  of 


visited  nodes  in  the  search  tree  is  at  most  min(ri, . . .  ,rfc)  • 
logj^o  n.  Note  that  the  actual  number  of  returned  records  may 
be  much  smaller  than  r^s.  In  the  worst  case,  it  may  even  be 
0;  consider  a  database  where  a  half  of  the  records  contain  a 
(but  not  /3)  and  the  other  half  (3  (but  not  a).  The  running 
time  for  the  query  aA/3  in  this  case  will  probably  be  linear 
in  n.  However,  we  stress  that  this  seems  to  be  inherent,  even 
without  any  security.  Indeed,  without  setting  up  an  index,  every 
algorithm  currently  known  runs  in  linear  time  to  process  this 
query. 

This  can  be  partially  addressed  by  setting  up  an  index,  in 
our  case  by  using  a  BF.  For  example,  for  AND  queries  on 
two  columns,  for  each  record  with  value  a  for  column  A, 
and  value  b  for  column  B,  the  following  keywords  are  added: 
A:a,  B:b,  AB:a.b.  With  this  approach,  the  indexed  AND 
queries  become  equivalent  to  single  term  queries.  However, 
this  cannot  be  fully  generalized,  as  space  grows  exponentially 
in  the  number  of  search  columns. 

Complex  queries.  The  performance  of  CNF  queries  can  be 
analyzed  by  viewing  them  as  AND  queries  where  each  disjunct 
(i.e,  OR  query)  is  treated  as  a  single  term  query.  In  general,  any 
other  complex  Boolean  query  can  be  converted  to  CNF  and 
then  analyzed  in  a  similar  manner.  In  other  words,  performance 
scales  with  the  number  of  results  returned  by  the  best  disjunct 
when  the  query  is  represented  in  CNF.  Note  that  we  do  not 
actually  need  to  convert  our  queries  to  this  form 

(nor  know  anything  about  the  data,  in  particular,  which 
are  high-  or  low-entropy  terms)  in  order  to  achieve  this 
performance  (this  aspect  is  even  better  than  MySQL). 

Computation  and  Communication.  Both  computational 
and  communication  resources  required  for  our  protocol  are 
proportional  to  the  query  complexities  described  above. 

False  Positives.  As  our  system  is  built  on  Bloom  filters,  false 
positives  are  possible.  In  our  experiments,  we  set  each  BF 
false  positive  rate  to  10“®.  Assuming  the  worst-case  scenario 
for  us,  where  the  DB  is  such  that  many  of  the  search  paths  do 
reach  and  query  the  BFs  at  the  leaves,  this  gives  10“®  false 
positive  probability  for  each  term  of  the  query.  Of  course,  the 
false  positive  is  a  tunable  parameter  of  our  system. 

IX.  Related  Work 

The  problem  of  private  DBMS  can  be  solved  by  general  pur¬ 
pose  secure  computation  schemes  [26],  [38],  [52],  [53].  These 
solutions,  however,  involve  at  least  linear  (often  much  more) 
work  in  the  database  size,  hence  cannot  be  used  for  practical 
applications  with  large  data.  Oblivious  RAM  (ORAM)  [27] 
can  be  used  to  completely  hide  the  client’s  query  pattern, 
and  can  also  be  used  as  a  tool  to  achieve  sublinear  amortized 
time  for  secure  computation  if  we  allow  to  leak  the  program 
running  time  [29],  [39].  Nonetheless,  computational  costs  are 
still  prohibitively  high,  rendering  these  solutions  impractical 
for  the  scale  we  are  interested  in. 

Private  Information  Retrieval  protocols  (PIR)  [16]  consider 
a  scenario  where  the  client  wishes  to  retrieve  the  ith  record  of 
the  server’s  data,  keeping  the  server  oblivious  of  the  index  i. 


Symmetric  PIR  protocols  [24]  additionally  require  that  client 
should  not  learn  anything  more  than  the  requested  record. 
While  most  PIR  and  SPIR  protocols  support  record  retrieval 
by  index  selection,  Chor  et  al.  [15]  considered  PIR  by  key¬ 
word.  Although  these  protocols  have  sublinear  communication 
complexity,  their  computation  is  polynomial  in  the  number  of 
records,  and  inefficient  for  practical  uses. 

Another  approach  would  be  to  use  fully  homomorphic 
encryption  (FHE).  In  2009,  Gentry  [21]  showed  that  FHE  is 
theoretically  possible.  Despite  this  breakthrough  and  many  fol¬ 
low  up  works,  current  constructions  are  too  slow  for  practical 
use.  Eor  example,  it  is  possible  to  homomorphic  ally  compute 
720  AES  blocks  in  two  and  a  half  days  [23]. 

Little  work  has  appeared  on  practical,  private  search  on 
a  large  data.  In  order  to  achieve  efficiency,  weaker  security 
(some  small  amount  leakage)  has  been  considered.  The  work 
of  [44],  [47]  supports  single  keyword  search  and  conjunctions. 
However,  the  solution  does  not  scale  well  to  databases  with 
a  large  number  of  records  (say  millions);  its  running  time  is 
linear  in  the  number  of  DB  records.  One  of  the  interesting 
features  of  this  work  is  the  way  they  address  range  queries. 
Our  system  also  uses  their  idea  to  achieve  range  queries, 
and  extends  it  to  support  negations  (since  we  use  a  sublinear 
underlying  OR  query,  our  range  queries  are  also  sublinear,  in 
contrast  to  them).  A  more  efficient  solution  towards  this  end 
was  proposed  in  [18].  However,  they  only  considered  single 
keyword  search. 

Any  single  keyword  search  solution  can  be  used  to  solve 
(insecurely)  arbitrarily  Boolean  formulas;  solve  each  keyword 
in  the  formula  separately  and  then  combine  (insecurely). 
Obviously,  however,  this  leaks  much  more  information  to  the 
parties  (and  also  has  work  proportional  to  the  sum  of  the  work 
for  each  term).  Our  system,  in  contrast,  provides  privacy  of 
the  overall  query  (and  work  proportional  to  just  the  smallest 
term). 

There  has  been  a  long  line  of  research  on  searchable 
symmetric  encryption  (SSE)  [11]-[13],  [17],  [25],  [41],  [50]. 
Note  that,  although  many  of  the  techniques  used  in  SSE 
schemes  can  be  used  in  our  scenario,  the  SSE  setting  focuses 
on  data  outsourcing  rather  than  data  sharing.  That  is,  in  SSE 
the  data  owner  is  the  client,  and  so  no  privacy  against  the 
client  is  required.  Additionally,  SSE  solutions  often  offer  either 
a  linear  time  search  over  the  number  of  database  records 
[12],  [41],  [50],  or  a  restricted  type  of  client’s  queries  [17], 
[32],  namely  single  keyword  search  or  conjunctions.  One 
exception  is  the  recent  SSE  scheme  of  [11],  which  extended 
the  approach  of  [17]  to  allow  for  any  Boolean  formula  of  the 
form  koA(j){ki, ...,  km-i),  where  (/)(•)  is  an  arbitrarily  Boolean 
formula.  Their  search  time  complexity  is  0{mxD{ko)),  where 
D{ko)  is  the  number  of  records  containing  keyword  kg.  Note 
that  an  arbitrary  formula  could  be  represented  this  way,  as 
ko  can  always  be  set  to  true,  but  then  the  complexity  will 
be  linear  in  the  number  of  records.  On  the  other  hand,  if  the 
client  can  format  the  query  so  that  ko  is  a  term  with  very 
few  matches,  the  complexity  will  go  down  accordingly.  In 
contrast,  our  solution  addresses  arbitrary  Boolean  formulas. 


with  complexity  proportional  to  the  best  term  in  the  CNF 
representation  of  the  formula.  Searchable  encryption  has  also 
been  studied  in  the  public  key  setting  [4],  [6],  [9],  [10],  [49]. 
Here,  many  users  can  use  the  server  public  key  to  encrypt  their 
own  data  and  send  it  to  the  server. 

The  CryptDB  system  [45]  addresses  the  problem  of  DB 
encryption  from  a  completely  different  angle,  and  as  such  is 
largely  incomparable  to  our  work.  CryptDB  does  not  aim  to 
address  the  issue  of  the  privacy  of  the  query  (but  it  does 
achieve  query  privacy  similar  to  the  single-keyword  search 
solution  described  above).  Their  threat  scenario  focuses  on 
DB  data  confidentiality  against  the  curious  DB  administrator, 
and  they  achieve  this  by  using  a  non-private  DBMS  over  what 
they  call  SQL-aware  encrypted  data.  That  is,  the  SQL  query  is 
pre-processed  by  a  fully  trusted  proxy  that  encrypts  the  search 
terms  of  the  query.  The  query  is  then  executed  by  standard 
SQL,  which  combines  the  results  of  individual-term  encrypted 
searches.  Additionally,  for  free-text  search,  CryptDB  uses  the 
linear  solution  of  [50]. 

The  closest  to  our  setting/work  is  a  a  very  recent  ex¬ 
tension  [31]  of  the  SSE  solution  [11],  which  additionally 
(to  the  SSE  requirements)  addresses  data  privacy  against  the 
client  (and  hence,  as  we  do,  addresses  private  DB).  We  note 
that  the  work  of  [11],  [31]  is  performed  independently  and 
concurrently  to  ours.  [31]  support  the  same  class  of  functions 
as  [11]  (discussed  above).  In  the  worst  case,  such  as  when  the 
client  has  little  a  priori  information  about  the  DB  and  chooses 
a  sub-optimal  term  to  appear  first  in  the  query  term,  the 
complexity  of  the  [31]  solution  can  be  linear  in  the  DB  size. 
In  contrast,  our  solution  for  general  formulas  does  not  depend 
on  the  client’s  knowledge  of  data  distribution  or  representation 
choice  (beyond  the  size  of  the  formula).  However,  we  note  that 
for  typical  practical  applications  this  is  not  a  serious  issue, 
as  the  client  can  represent  his  query  as  a  conjunction,  and 
moreover,  can  make  a  good  guess  for  which  term  will  have 
low  frequency  in  the  data  and  is  a  good  choice  as  the  first 
term.  Thus,  a  large  majority  of  practically  useful  queries  can 
be  evaluated  by  [31]  with  asymptotic  complexity  similar  to 
ours.  In  terms  of  security,  our  guarantees  vary:  [31]  achieves 
security  against  malicious  client,  which  is  much  stronger  than 
our  semi-honest  setting,  and  of  particular  importance  for  the 
policy  enforcement.  Our  leakages  vary  and  are  incomparable. 
We  and  [31]  leak  different  access  pattern  structures  (search 
tree  for  us  and  index  lookups  for  [31]).  Because  we  use  a 
more  expensive  basic  step  of  SEE,  our  protection  of  query- 
related  data,  at  least  in  some  cases,  is  somewhat  better.  Eor 
example,  depending  on  the  DB  data,  we  may  hide  everything 
about  the  individual  terms  of  the  query,  while  [31]  leak  to 
the  client  and  (their  counterpart  of  the)  IS  the  support  sizes 
for  individual  terms  of  the  disjunctive  queries  (individual  term 
supports  are  revealed  to  the  client,  but  this  is  only  an  issue  if 
the  query  does  not  ask  for  all  the  columns  of  the  records). 

At  the  same  time,  the  concrete  query  performance  of  [31]  is 
somewhat  better  than  ours,  due  to  their  elegant  non-interactive 
approach.  The  very  expensive  step  of  DB  setup  is  faster  for 
us,  and  the  CPU  load  is  lower,  as  we  use  mainly  symmetric- 


key  primitives.  We  also  note  that  our  interactive  approach 
allows  significant  flexibility.  Eor  example,  the  0-1  security 
(cf.  Section  V-B),  is  naturally  and  cheaply  achievable  in  our 
system;  it  appears  harder/more  expensive  to  achieve  in  a  non¬ 
interactive  system,  and  in  fact  is  not  considered  in  [10].  The 
use  of  GC  as  the  basic  block  similarly  provides  significant 
flexibility  and  opportunities  for  feature  expansion.  Einally,  [31] 
naturally  support  multiple  clients,  while  our  natural  extensions 
to  multiple  clients  require  that  all  clients  share  a  secret  key 
not  known  to  IS. 

Because  of  the  different  trade  offs  presented  by  our  work 
and  that  of  [31],  each  system  is  better  suited  for  different 
applications/use  cases.  It  is  interesting  to  note  that  these  two 
works,  the  first  ones  to  address  the  major  open  problem  of 
truly  practical,  provably  secure,  and  very  rich  (including  any 
formula)  query  DBMS,  are  based  on  very  different  technical 
approaches.  We  believe  that  this  adds  to  the  value  and  strength 
of  each  of  these  systems.^ 

X.  Discussion  and  Motivation  of  Our  Setting 

Semi-honest  model.  Semi-honest  model  is  often  reasonable 
in  practice,  especially  in  the  Government  use  scenarios.  Eor 
example,  C,  S  and  index  server  may  be  Government  agencies, 
whose  systems  are  verified  and  trusted  to  execute  the  pre¬ 
scribed  code.  Eurther,  regular  audits  will  help  enforce  semi- 
honest  behavior. 

Security  against  malicious  adversaries  can  be  added  by  stan¬ 
dard  techniques,  but  this  results  in  impractical  performance. 
In  follow  up  work  we  show  how  to  amend  our  protocols  to 
protect  against  one  malicious  player  (C  or  IS)  at  a  very  small 
cost  (ca.  10%  increase).  This  is  possible  mainly  because  the 
underlying  GC  protocols  are  already  secure  against  malicious 
evaluator. 

Impact  of  the  allowed  leakage.  Eormally  pinning  down 
exact  privacy  loss  is  beyond  the  reach  of  state-of-the-art 
cryptography,  even  with  no  leakage  beyond  the  output  and 
amount  of  work  (the  field  of  differential  privacy  is  working 
on  this  problem,  with  very  moderate  success).  Therefore, 
understanding  our  leakage  and  its  impact  for  specific  appli¬ 
cations  is  crucial  to  ascertain  whether  it’s  acceptable.  We 
informally  investigated  the  impact  of  leakage  in  several  natural 
applications,  such  as  population  DBs  and  call-record  DBs 
and  query  patterns  (see  example  below);  we  believe  that  our 
protection  is  insufficient  in  some  scenarios,  while  in  many 
others  it  provides  strong  guarantees. 

Rough  leakage  estimation  for  call-records  DB.  Consider 
a  call-records  DB,  including  columns  (Phone  number, 
Callee  phone  number,  time  of  call).  The  client  C 
is  allowed  to  only  ask  queries  of  the  form  select  *  where 
phone  number  =  xxx  AND  callee  phone  number 
=  yyy  AND  time  of  call  €  {interval}. 

Eor  typical  call  patterns  (e.g.,0-10  calls/person/day),  the 
query  leakage  will  almost  always  constitute  a  tree  with 

■^We  note  that  in  an  earlier  stage  there  were  two  other  performers  on 
the  lARPA  SPAR  program.  However,  we  do  not  know  the  details  of  their 
approaches,  and  are  not  aware  of  published  work  presenting  their  solutions. 


branches  either  going  to  the  leafs  (returned  records)  or  trun¬ 
cated  one  or  two  levels  from  the  root.  We  believe  that  for  many 
purposes  this  is  acceptable  leakage.  Again,  we  stress  that  this 
is  not  a  formal  or  detailed  analysis  (which  is  beyond  the  reach 
of  today’s  state-of-the-art);  it  is  included  here  to  support  our 
belief  that  our  system  gives  good  privacy  protection  in  many 
reasonable  scenarios. 

Reliance  on  the  third  party.  While  a  two-party  solution  is 
of  course  preferable,  these  state-of-the-art  solutions  are  orders 
of  magnitude  slower  than  what  is  required  for  scalable  DB 
access.  Probably  the  most  reasonable  approach  would  be  to 
use  ORAM,  which  is  set  up  either  by  a  trusted  party  or  as 
a  (very  expensive)  2-PC  between  data  owner  and  the  querier. 
Then  the  querier  can  query  the  ORAM  held  by  the  data  owner. 
Due  to  privacy  requirements,  each  ORAM  step  must  be  done 
over  encrypted  data,  which  triggers  performance  that  is  clearly 
unacceptable  for  the  scale  required  in  our  application  (cf.  [29]). 

Further,  in  Government  use  cases,  employing  third  party  is 
often  seen  as  reasonable.  For  example,  such  a  player  can  be 
run  by  a  neutral  agency.  We  emphasize  that  the  third  party  is 
not  trusted  with  the  data  or  queries,  but  is  trusted  not  to  share 
information  with  the  other  parties. 

XL  Conclusion 

Guaranteeing  complete  search  privacy  for  both  the  client  and 
the  server  is  expensive  with  today’s  state  of  the  art.  However, 
a  weaker  level  of  privacy  is  often  acceptable  in  practice, 
especially  as  a  trade-off  for  much  greater  efficiency.  We 
designed,  proved  secure,  built  and  evaluated  a  private  DBMS, 
named  Blind  Seer,  capable  of  scaling  to  tens  of  TB’s  of  data. 
This  breakthrough  performance  is  achieved  at  the  expense 
of  leaking  search  tree  traversal  information  to  the  players. 
Our  performance  evaluation  results  clearly  demonstrate  the 
practicality  of  our  system,  especially  on  queries  that  return 
a  few  results  where  the  performance  overhead  over  plaintext 
MySQL  was  from  just  1.2  x  to  3x  slowdown. 

We  note  that  the  range  from  complete  privacy  to  best 
performance  is  wide  and  our  work  only  targets  a  specific  point 
within  it.  We  see  it  as  a  step  towards  exploring  several  other 
trade-offs  in  this  space.  Our  goal  for  future  work  is  to  develop 
a  highly  tunable  system  which  will  be  able  to  be  configured 
and  match  many  practical  scenarios  with  different  privacy  and 
performance  requirements. 

Acknowledgments.  This  work  was  supported  in  part  by  the 
Intelligence  Advanced  Research  Project  Activity  (lARPA)  via 
Department  of  Interior  National  Business  Center  (DoI/NBC) 
contract  Number  DI1PC20I94.  The  U.S.  Government  is  au¬ 
thorized  to  reproduce  and  distribute  reprints  for  Governmental 
purposes  notwithstanding  any  copyright  annotation  thereon. 
Disclaimer:  The  views  and  conclusions  contained  herein  are 
those  of  the  authors  and  should  not  be  interpreted  as  nec¬ 
essarily  representing  the  official  policies  or  endorsements, 
either  expressed  or  implied,  of  lARPA,  DoI/NBC,  or  the  U.S. 
Government. 

Fernando  Krell  was  supported  by  BECAS  CHILE,  CONI- 
CYT,  Gobierno  de  Chile. 


This  material  is  based  upon  work  supported  by  (while  author 
Keromytis  was  serving  at)  the  National  Science  Eoundation. 
Any  opinion,  findings,  and  conclusions  or  recommendations 
expressed  in  this  material  are  those  of  the  author(s)  and 
do  not  necessarily  reflect  the  views  of  the  National  Science 
Eoundation. 

We  thank  MIT  Lincoln  Labs  researchers  for  supporting 
this  program  from  the  beginning  to  the  end  and  facilitating 
extensive  testing  of  our  code. 

Einally,  we  thank  our  colleagues  from  other  lARPA  SPAR 
teams  for  great  collaboration  and  exchange  of  ideas. 

References 

[1]  lARPA  Security  and  Privacy  Assurance  Research  (SPAR)  program,  http: 
//www.iarpa.gov/Programs/sso/SPAR/spar.html. 

[2]  The  porter  stemming  algorithm.  http://tartarus.org/martin/ 

PorterS  t  emmer/ . 

[3]  Privacy  groups  file  lawsuit  over  license  plate  scanners.  http://www. 
therepublic.com/view/story/210d27e7585543a3941f5e577cf7f627/ 

C  A-  -  License-  Plate-  Suit. 

[4]  M.  Abdalla,  M.  Bellare,  D.  Catalano,  E.  Kiltz,  T.  Kohno,  T.  Lange, 
J.  Malone-Lee,  G.  Neven,  P.  Paillier,  and  H.  Shi.  Searchable  encryp¬ 
tion  revisited:  Consistency  properties,  relation  to  anonymous  IBE,  and 
extensions.  J.  Cryptol,  21(3):350-391,  2008. 

[5]  D.  Beaver.  Precomputing  oblivious  transfer.  In  D.  Coppersmith,  editor, 
CRYPTO’95,  volume  963  of  LNCS,  pages  97-109.  Springer,  Aug.  1995. 

[6]  M.  Bellare,  A.  Boldyreva,  and  A.  O’Neill.  Deterministic  and  efficiently 
searchable  encryption.  In  Proceedings  of  CRYPTO’07,  2007. 

[7]  M.  Bellare,  V.  T.  Hoang,  and  P.  Rogaway.  Foundations  of  garbled 
circuits.  In  T.  Yu,  G.  Danezis,  and  V.  D.  Gligor,  editors,  ACM  CCS 
12,  pages  784-796.  ACM  Press,  Oct.  2012. 

[8]  B.  H.  Bloom.  Space/time  trade-offs  in  hash  coding  with  allowable  errors. 
Commun.  ACM,  13(7):422^26,  1970. 

[9]  D.  Boneh,  G.  D.  Crescenzo,  R.  Ostrovsky,  and  G.  Persiano.  Public  key 
encryption  with  keyword  search.  In  Proceedings  of  EUROCRYPT’04, 
pages  506-522,  2004. 

[10]  D.  Boneh  and  B.  Waters.  Conjunctive,  subset,  and  range  queries  on 
encrypted  data.  In  S.  P.  Vadhan,  editor,  TCC  2007,  volume  4392  of 
LNCS,  pages  535-554.  Springer,  Eeb.  2007. 

[11]  D.  Cash,  S.  Jarecki,  C.  S.  Jutla,  H.  Krawczyk,  M.-C.  Rosu,  and 
M.  Steiner.  Highly-scalable  searchable  symmetric  encryption  with 
support  for  boolean  queries.  In  R.  Canetti  and  J.  A.  Garay,  editors, 
CRYPTO  2013,  Part  I,  volume  8042  of  LNCS,  pages  353-373.  Springer, 
Aug.  2013. 

[12]  Y.-C.  Chang  and  M.  Mitzenmacher.  Privacy  preserving  keyword  searches 
on  remote  encrypted  data.  In  ACNS,  volume  3531,  2005. 

[13]  M.  Chase  and  S.  Kamara.  Structured  encryption  and  controlled  dis¬ 
closure.  In  M.  Abe,  editor,  ASIACRYPT  20W,  volume  6477  of  LNCS, 
pages  577-594.  Springer,  Dec.  2010. 

[14]  S.  G.  Choi,  J.  Katz,  R.  Kumaresan,  and  H.-S.  Zhou.  On  the  security 
of  the  “free-XOR”  technique.  In  R.  Cramer,  editor,  TCC  2012,  volume 
7194  of  LA^C^,  pages  39-53.  Springer,  Mar.  2012. 

[15]  B.  Chor,  N.  Gilboa,  and  M.  Naor.  Private  information  retrieval  by 
keywords.  Technical  Report  TR-CS0917,  Dept,  of  Computer  Science, 
Technion,  1997. 

[16]  B.  Chor,  O.  Goldreich,  E.  Kushilevitz,  and  M.  Sudan.  Private  information 
retrieval.  L  ACM,  45(6):965-981,  1998. 

[17]  R.  Curtmola,  J.  A.  Garay,  S.  Kamara,  and  R.  Ostrovsky.  Searchable 
symmetric  encryption:  improved  definitions  and  efficient  constructions. 
In  ACM  CCS  06,  pages  79-88,  2006. 

[18]  E.  De  Cristofaro,  Y.  Lu,  and  G.  Tsudik.  Efficient  techniques  for  privacy¬ 
preserving  sharing  of  sensitive  information.  In  TRUST’ll,  pages  239- 
253,  2011. 

[19]  T.  ElGamal.  A  public  key  cryptosystem  and  a  signature  scheme  based  on 
discrete  logarithms.  IEEE  Transactions  on  Information  Theory,  31:469- 
472,  1985. 

[20]  S.  Even,  O.  Goldreich,  and  A.  Lempel.  A  randomized  protocol  for 
signing  contracts.  In  D.  Chaum,  R.  L.  Rivest,  and  A.  T.  Sherman, 
editors,  CRYPTO’82,  pages  205-210.  Plenum  Press,  New  York,  USA, 
1982. 


[21]  C.  Gentry.  Fully  homomorphic  encryption  using  ideal  lattices.  In 
M.  Mitzenmacher,  editor,  41st  ACM  STOC,  pages  169-178.  ACM  Press, 
May  /  June  2009. 

[22]  C.  Gentry,  K.  A.  Goldman,  S.  Halevi,  C.  Julta,  M.  Raykova,  and 
D.  Wichs.  Optimizing  oram  and  using  it  efficiently  for  secure  com¬ 
putation.  In  Privacy  Enhancing  Technologies,  pages  1-18.  Springer, 
2013. 

[23]  C.  Gentry,  S.  Halevi,  and  N.  P.  Smart.  Homomorphic  evaluation  of  the 
AES  circuit.  In  R.  Safavi-Naini  and  R.  Canetti,  eitors,  CRYPTO  2012, 
volume  7417  of  LNCS,  pages  850-867.  Springer,  Aug.  2012. 

[24]  Y.  Gertner,  Y.  Ishai,  E.  Kushilevitz,  and  T.  Malkin.  Protecting  data 
privacy  in  private  information  retrieval  schemes.  Journal  of  Computer 
and  System  Sciences,  60(3):592-629,  2000. 

[25]  E.-J.  Goh.  Secure  indexes.  lACR  Cryptology  ePrint  Archive,  2003:216, 

2003. 

[26]  O.  Goldreich,  S.  Micali,  and  A.  Wigderson.  How  to  play  any  mental 
game  or  A  completeness  theorem  for  protocols  with  honest  majority. 
In  A.  Aho,  editor,  19th  ACM  STOC,  pages  218-229.  ACM  Press,  May 
1987. 

[27]  O.  Goldreich  and  R.  Ostrovsky.  Software  protection  and  simulation  on 
oblivious  rams.  J.  ACM,  43:431^73,  1996. 

[28]  S.  Goldwasser  and  S.  Micali.  Probabilistic  encryption.  Journal  of 
Computer  and  System  Sciences,  28(2):270-299,  1984. 

[29]  S.  D.  Gordon,  J.  Katz,  V.  Kolesnikov,  E.  Krell,  T.  Malkin,  M.  Raykova, 
and  Y.  Vahlis.  Secure  two-party  computation  in  sublinear  (amortized) 
time.  In  ACM  CCS  12,  pages  513-524,  2012. 

[30]  Y.  Ishai,  J.  Kilian,  K.  Nissim,  and  E.  Petrank.  Extending  oblivious 
transfers  efficiently.  In  D.  Boneh,  editor,  CRYPTO  2003,  volume  2729 
of  LNCS,  pages  145-161.  Springer,  Aug.  2003. 

[31]  S.  Jarecki,  C.  S.  Jutla,  H.  Krawczyk,  M.-C.  Rosu,  and  M.  Steiner. 
Outsourced  symmetric  private  information  retrieval.  In  A.-R.  Sadeghi, 
V.  D.  Gligor,  and  M.  Yung,  editors,  ACM  CCS  13,  pages  875-888.  ACM 
Press,  Nov.  2013. 

[32]  S.  Kamara  and  C.  Papamanthou.  Searching  Dynamic  Encrypted  Data 
in  Parallel.  In  FC  2013,  2013. 

[33]  D.  M.  Kays.  Reasons  to  “friend”  electronic  discovery  law.  Franchise 
Law  Journal,  32(1),  2012. 

[34]  V.  Kolesnikov.  Gate  evaluation  secret  sharing  and  secure  one-round 
two-party  computation.  In  B.  K.  Roy,  editor,  ASIACRYPT  2005,  volume 
3788  of  LNCS,  pages  136-155.  Springer,  Dec.  2005. 

[35]  V.  Kolesnikov  and  R.  Kumaresan.  Improved  secure  two-party  com¬ 
putation  via  information-theoretic  garbled  circuits.  In  I.  Visconti  and 
R.  D.  Prisco,  editors,  SCN  12,  volume  7485  of  LNCS,  pages  205-221. 
Springer,  Sept.  2012. 

[36]  V.  Kolesnikov  and  T.  Schneider.  Improved  garbled  circuit:  Free  XOR 
gates  and  applications.  In  L.  Aceto,  I.  Damgard,  L.  A.  Goldberg,  M.  M. 
Halldorsson,  A.  Ingolfsdottir,  and  I.  Walukiewicz,  editors,  ICALP  2008, 
Part  11,  volume  5126  of  LNCS,  pages  486^98.  Springer,  July  2008. 

[37]  V.  Kolesnikov  and  T.  Schneider.  A  practical  universal  circuit  construc¬ 
tion  and  secure  evaluation  of  private  functions.  In  G.  Tsudik,  editor,  FC 
2008,  volume  5143  of  LNCS,  pages  83-97.  Springer,  Jan.  2008. 

[38]  Y.  Lindell  and  B.  Pinkas.  A  proof  of  security  of  Yao’s  protocol  for  two- 
party  computation.  Journal  of  Cryptology,  22(2):161-188,  Apr.  2009. 

[39]  S.  Lu  and  R.  Ostrovsky.  Distributed  oblivious  ram  for  secure  two-party 
computation.  In  TCC,  pages  377-396,  2013. 

[40]  D.  Malkhi,  N.  Nisan,  B.  Pinkas,  and  Y.  Sella.  Fairplay  -  secure  two-party 
computation  system.  In  USENIX  Security  Symposium,  pages  287-302, 

2004. 

[41]  T.  Moataz  and  A.  Shikfa.  Boolean  symmetric  searchable  encryption. 
In  ASIACCS  2013:  8th  ACM  Symposium  on  Information,  Computer  and 
Communications  Security,  2013. 

[42]  M.  Naor  and  B.  Pinkas.  Computationally  secure  oblivious  transfer. 
Journal  of  Cryptology,  18(l):l-35,  Jan.  2005. 

[43]  J.  E.  Pace  III.  Testing  the  security  blanket:  An  analysis  of  recent 
challenges  to  stipulated  blanket  protective  orders.  Antitrust  Magazine, 
19(3),  2005. 

[44]  V.  Pappas,  M.  Raykova,  B.  Vo,  S.  M.  Bellovin,  and  T.  Malkin.  Private 
search  in  the  real  world.  In  AC5AC  ’ll,  pages  83-92,  2011. 

[45]  R.  A.  Popa,  C.  M.  S.  Redfield,  N.  Zeldovich,  and  H.  Balakrishnan. 
Cryptdb:  protecting  confidentiality  with  encrypted  query  processing.  In 
SOSP  ’ll,  pages  85-100.  ACM,  2011. 

[46]  M.  O.  Rabin.  How  to  exchange  secrets  by  oblivious  transfer.  In  Technical 
Report  TR-81.  Aiken  Computation  Laboratory,  Harvard  University, 
1981. 


[47]  M.  Raykova,  B.  Vo,  S.  Bellovin,  and  T.  Malkin.  Secure  anonymous 
database  search.  In  CCSW  2009.,  2009. 

[48]  P.  Rogaway.  The  round  complexity  of  secure  protocols.  PhD  thesis, 
Massachusetts  Institute  of  Technology,  1991. 

[49]  E.  Shi,  J.  Bethencourt,  H.  T.-H.  Chan,  D.  X.  Song,  and  A.  Perrig.  Multi¬ 
dimensional  range  query  over  encrypted  data.  In  2007  IEEE  Symposium 
on  Security  and  Privacy,  pages  350-364.  IEEE  Computer  Society  Press, 
May  2007. 

[50]  D.  X.  Song,  D.  Wagner,  and  A.  Perrig.  Practical  techniques  for  searches 
on  encrypted  data.  In  Proceedings  of  the  2000  IEEE  Symposium  on 
Security  and  Privacy,  SP  ’00,  pages  44-,  Washington,  DC,  USA,  2000. 
IEEE  Computer  Society. 

[51]  J.  K.  Yan  Huang,  David  Evans  and  L.  Malka.  Faster  secure  two-party 
computation  using  garbled  circuits.  In  USENIX  Security  Symposium. 
USENIX  Association,  2011. 

[52]  A.  C.-C.  Yao.  Protocols  for  secure  computations  (extended  abstract).  In 
23rd  FOCS,  pages  160-164.  IEEE  Computer  Society  Press,  Nov.  1982. 

[53]  A.  C.-C.  Yao.  How  to  generate  and  exchange  secrets  (extended  abstract). 
In  27th  FOCS,  pages  162-167.  IEEE  Computer  Society  Press,  Oct.  1986. 

Appendix  A 

Representing  Query  &  Policy 

Encoding  a  query.  In  our  system,  a  query  is  represented 
as  a  Bloom  filter.  This  filter  contains  all  the  relevant  columns 
and  operations,  and  search  terms  and  conditions.  For  example, 
consider  the  following  query: 

SELECT  id  WHERE  fname  =  ALICE  AND  dob  <=  1975-1-1 
AND  CONTAINED_IN (notesl,  engineer) 

The  bloom  filter  will  contain  the  following: 

•  fname,  fname :=,  fname: ALICE,  f name :=: ALICE 
.  dob,  dob:<=,  dob : 1 97  5-1-1,  dob :<=: 1 97  5-1-1 

•  notesl,  notesl : contained_in, 

notesl :engineer, 

notesl : contained_in : engineer 

Policy  circuit.  The  current  implementation  provides  a  parser 
for  any  policy  that  can  be  represented  as  a  monotone  DNF 
where  each  variable  indicates  whether  some  policy  condition 
(BF  keyword)  belongs  to  the  input  BF  representing  a  query  as 
described  above;  if  the  formula  output  is  true,  then  the  client’s 
query  is  disallowed.  For  example,  a  policy  may  disallow  a 
query  if  it  contains  an  equality  check  on  fname  with  value 
ALICE  and  a  range  in  dob.  In  this  case,  the  policy  circuit  is 
a  simple  formula  Vi  AND  V2,  where  the  variable  Vi  is  true 
if  the  input  BF  contains  fname  :  =  :  ALICE,  and  V2  is  true  if 
the  filter  contains  dob :  <=.  Indeed,  query  (2)  above  will  be 
disallowed. 

We  believe  that  this  provides  a  wide  coverage  of  policies. 
For  example,  our  parser  also  supports  a  policy  that  allows  only 
range  operation  on  fname,  indirectly.  One  technical  issue  is 
that  we  do  not  want  to  allow  any  false  approval  of  a  query 
that  fails  the  policy  (though  a  tunable  small  probability  of 
false  rejection  of  a  good  query  is  acceptable),  but  the  Bloom 
filters  allow  no  false  negatives.  We  can  fix  this  issue  by  adding 
keywords  representing  absence  column,  or  column  operators 
to  the  BF.  In  the  example  above  the  system  adds  the  following 
keywords: 

•  NOT : fname : range,  N0T:dob:=, 

NOT : notesl : stem, 

NOTrlname,  NOT  :  zip,  NOT  : marital_status  .... 


Now,  the  aforementioned  policy  is  equivalent  to  one  that 
disallows  queries  if  the  corresponding  the  BF  contains  fname 
and  NOT  :  fname  :  range.  If  the  check  succeeds,  then  the  query 
is  disallowed.  Likewise,  a  policy  allowing  only  equality  oper¬ 
ation  on  dob  will  check  if  the  filter  has  dob  and  not  :  dob :  =. 
In  addition,  the  policy  can  now  disallow  queries  that  do  not 
contain  an  equality  on  dob  column  or  that  do  not  contain 
Iname.  More  importantly,  the  policy  can  now  enforce  that 
the  query  must  have  Iname  value  if  fname  was  present. 

Appendix  B 

One-Case  Indistinguishability 

Here,  we  give  a  formal  definition  of  one-case  indistinguisha¬ 
bility.  Since  our  system  realizes  the  ideal  functionality  the 
definitions  concern  only  input/output  behavior  and  the  leakage 
profile  L. 

The  distribution  E  discussed  in  Section  V-B  with  8  is 
defined  as  follows: 

Let  {Do,q,r)  be  a  database,  a  query  and  a  record 
as  specified  in  Section  V-B.  Choose  a  record  in  Dq 
uniformly  at  random  and  replace  it  with  r.  Let  Di 
be  the  modified  database.  Choose  a  bit  6  G  {0, 1} 
according  to  the  following  distribution: 

Pr[6  =  1]  =  (5,  Pr[6  =  0]  =  1  —  (5. 

Run  Edb,  calling  Init  with  {Dq,P),  and  Query  with 
q.  Let  V  be  the  leakage  to  the  index  server.  Output 

(6,v). 

We  show  that  our  system  satisfies  one-case  indistinguisha¬ 
bility.  Note  that  the  initial  leakage  is  none,  and  therefore, 
we  only  need  to  consider  the  query  leakage  which  is  the 
query  pattern  and  the  tree  search  pattern.  This  implies  that 
we  only  need  to  consider  the  tree  search  pattern  since  the 
same  query  is  considered  in  the  experiment.  Observe  that  the 
newly  introduced  record  r  is  equivalent  to  adding  a  random 
paths  in  terms  of  the  tree  search  pattern.  Therefore,  it  suffices 
to  focus  on  the  number  of  added  random  paths.  In  particular, 
let  £)+  be  defined  as  follows: 


x^T>;  output  {x  -b  1). 


Now,  consider  a  following  game  X: 

Choose  a  bit  5  G  {0, 1}  such  that  Pr[5  =  1]  =  5  and 
Pr[fo  =  0]  =  1  —  (5.  If  6  =  0,  let  x^V;  otherwise 
let  x^V'^.  Output  {b,x). 

Now,  we  show  that  for  any  x,  it  holds  that 


l^r[&  =  1\  x]  <25. 


We  show  this  by  using  case  analysis: 

•  When  a;  <  1,  it  never  comes  from  so  the  inequality 
trivially  holds. 

•  When  2  <  a;  <  a  —  1,  it  holds  that 


Pr[6  =  1|  x] 


Pr[X  =  (l,x)] 
Pr[x] 


5  jot 

bja  -b  (1  —  (5)/a 


•  When  X  >  a,  it  holds  that 


Pr[&  =  1|  x] 


Pr[X  =  (l,x)] 
Pr[x] 


<5-  (1/a)  •  1/2"^-“ 

5  •  (1/a)  •  1/2^-“  -b  (1  -  5)  •  (1/a)  •  1/2^ 
(5-b(l-(5)/2  l-b(5  - 
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